diff --git a/test/antithesis/scratchbook/bug-ledger.md b/test/antithesis/scratchbook/bug-ledger.md
new file mode 100644
index 00000000000..75cd0b8ce39
--- /dev/null
+++ b/test/antithesis/scratchbook/bug-ledger.md
@@ -0,0 +1,74 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence space — headline guarantees that frame these defects as bugs.
+  - path: https://github.com/DataDog/saluki/pull/1768
+    why: PR review #4393897611 / Codex P2 flagged the now-stale aggregate panic repro reconciled here.
+---
+
+# Bug Ledger
+
+Accounting of every defect discovered during `antithesis-research`, with the vehicle used to
+demonstrate it. Goal: burn each discovered bug into a local repro case or an Antithesis triage shot.
+**No fixes were applied.** Following TDD, each local repro asserts the *desired* invariant, so it
+**currently FAILS** against the buggy code — the failing test *is* the demonstration. It turns green
+once the defect is fixed (a ready-made regression guard). Each test's comment describes why/how the
+bug happens.
+
+## Burned into local repro cases (failing unit tests)
+
+| # | Property | Test | Defect |
+|---|----------|------|--------|
+| 1 | `ddsketch-no-nan-poison` | `lib/ddsketch/src/agent/sketch.rs::tests::bug_nan_sample_poisons_sum_and_avg` | One NaN sample permanently poisons sketch `sum`/`avg` (sticky), `count`/`min`/`max` stay valid → silent corruption; no finiteness guard at the sketch boundary. |
+| 2 | `replay-corruption-not-silent-eof` | `lib/saluki-components/.../sources/dogstatsd/replay/reader.rs::tests::bug_corrupt_length_prefix_silently_drops_following_records` | A corrupt/oversized length prefix is read as clean EOF → all following well-formed records silently dropped (false replay fidelity / data loss). |
+| 3 | `aggregate-clock-skew-stable` (forward-jump facet) | `lib/saluki-components/.../transforms/aggregate/mod.rs::tests::bug_forward_clock_jump_floods_zero_value_points` | No `current_time >= last_flush` guard; a forward wall-clock jump builds `zero_value_buckets` over the whole interval — O(jump) work/alloc — and floods one idle counter with points proportional to the jump. |
+| 4 | `rss-bounded-under-cardinality` / `interner-full-bounded` (root cause) | `lib/saluki-context/src/resolver.rs::tests::bug_default_heap_fallback_makes_context_resolution_unbounded` | `allow_heap_allocations` defaults true → a full fixed-size interner silently spills to the heap and `resolve` never refuses → unbounded memory under a high-cardinality flood; the only bounded mode (heap disallowed) silently drops. |
+| 5 | `config-stall-no-deadlock` | `lib/saluki-config/src/lib.rs::tests::bug_config_ready_hangs_forever_without_snapshot` | `GenericConfiguration::ready()` awaits the first dynamic snapshot with no internal timeout; with the sender held open and silent, it never resolves → ADP startup hangs forever. |
+
+Run all five (expect five FAILURES — the failing tests are the demonstrations):
+`cargo nextest run --no-fail-fast -E 'test(/bug_nan_sample_poisons_sum_and_avg|bug_corrupt_length_prefix_silently_drops_following_records|bug_forward_clock_jump_floods_zero_value_points|bug_default_heap_fallback_makes_context_resolution_unbounded|bug_config_ready_hangs_forever_without_snapshot/)'`
+
+## Resolved upstream on main (repro now stale)
+
+- **`aggregate-no-panic-any-window` — sub-second window `% 0` panic (was bug #1).** Fixed on main:
+  the config key is renamed `aggregate_window_duration_seconds` and typed `NonZeroU64`, and
+  `align_to_bucket_start` divides by `bucket_width_secs.get()` end-to-end
+  (`transforms/aggregate/mod.rs:95-98,822-823`), so a zero/sub-second window now fails config
+  parsing instead of reaching the divisor (PR #1772). The repro
+  `tests::bug_sub_second_aggregate_window_panics_on_insert` lives in a **sibling stack commit**
+  (`chore(agent-data-plane): failing repros for six discovered bugs`), not this docs commit. **Action
+  required there (out of scope for this commit):** delete that test or convert it to a passing
+  regression guard, since the desired invariant now holds. The catalog entry is reframed as a
+  low-cost `Unreachable` regression tripwire.
+
+## Burned into an Antithesis triage shot (submitted run)
+
+- **`rss-bounded-under-cardinality` (behavioral)** and **`forwarder-eventual-delivery` (baseline liveness)** — run id (redacted; tracked internally) (test-name `saluki-adp-bug-hunt`, 30 min, submitted 2026-05-29). The `parallel_driver_send_dogstatsd` high-cardinality regime drives memory growth; `finally_verify_delivery` checks delivery. Triage with the `antithesis-triage` skill once it completes.
+
+## Antithesis-shot-only — blocked on harness infrastructure (not locally reproducible)
+
+These discovered defects cannot be reproduced as local unit tests and require infrastructure not yet
+built. Each is a follow-up Antithesis shot, not a current repro.
+
+- **`memory-limiter-survives-rss-read-failure`** — `check_memory_usage` does `querier.resident_set_size().expect(...)`; an RSS-read failure panics the checker thread, silently disabling memory protection. **Not locally reproducible:** the `Querier` is constructed internally with no injection seam. **Shot blocker:** a custom `/proc` fault (enabled for the tenant) + a memory-limiter-enabled ADP config variant + a SUT-side `assert_unreachable!` at the `.expect` site.
+- **`config-runtime-update-not-revalidated`** — the incompatibility gate runs only at startup; a runtime config-stream update can introduce a high-severity-incompatible key with no re-gate. **Shot blocker:** config-stream add-on **and** human confirmation of intent (intentional trust of the authoritative Agent vs. oversight) — flagged `(needs human input)` in the catalog.
+- **`shutdown-drains-no-loss` / `graceful-shutdown-within-30s` (forceful-stop data loss)** — behavioral; needs the running harness shut down under a slow/blocked intake to exceed the 30s grace window. **Shot blocker:** an intake failure-mode toggle + a shutdown driver.
+
+## Recorded as covered / not a distinct defect
+
+- **`aggregate-clock-skew-stable` (backward-jump gap)** — dual of bug #4; same root cause (no monotonicity guard on `current_time` vs `last_flush`). Covered by #4's triage.
+- **aggregate context-limit counts contexts, not bytes** — a single context with many timestamped values has unbounded value memory; a facet of `rss-bounded-under-cardinality`, partially covered by #5 and the catalog's open question.
+- **`ddsketch-relative-error-bound` merge non-associativity** — f64 ordering sensitivity; a library-level numeric property, not a clear ADP runtime defect (the catalog demoted it to a harness/proptest invariant).
+- **`source-dispatch-no-misroute`** — misroute is structurally improbable with the current `extract`-then-`send_all` ordering; the live facet (silent uncounted loss on a downstream dispatch error) would need a failing-dispatcher harness. Candidate future local repro; not a confirmed defect today.
+
+## Status
+
+Cleanly-reproducible discovered bugs that remain open: **5 local repros + 1 submitted run.** One
+original repro (the aggregate sub-second `% 0` panic) was **fixed upstream on main** (PR #1772) and is
+recorded above as resolved-stale — its repro in the sibling stack commit needs removal/conversion. The
+remaining items are explicitly blocked on harness infrastructure (custom `/proc` fault, config-stream
+add-on, intake failure toggle) or human input, and are recorded above as follow-up Antithesis shots.
+No further bug is reproducible without building that infrastructure.
diff --git a/test/antithesis/scratchbook/deployment-topology.md b/test/antithesis/scratchbook/deployment-topology.md
new file mode 100644
index 00000000000..fcef2890c0d
--- /dev/null
+++ b/test/antithesis/scratchbook/deployment-topology.md
@@ -0,0 +1,217 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence — correctness-test topology (diff-testing Agent vs ADP via fakeintake) informs reuse.
+  - path: https://datadoghq.atlassian.net/wiki/spaces/AMCC/pages/6640602441/Tag+Filter+RC+Relay+Stress+Test+agent+ADP
+    why: Documents the Core Agent → AgentSecure config-stream → ADP relay topology (informs the config-stream add-on).
+---
+
+# Deployment Topology: Agent Data Plane (ADP)
+
+## Guiding principle
+
+The existing `bin/correctness` harness (`millstone` + `datadog-intake` + `panoramic` + `airlock`) is
+a near-perfect base — but it is built for **deterministic diff-testing with a healthy intake**.
+Antithesis's value is the opposite: **fault the links the harness keeps healthy**. The single most
+important design move is to put the **ADP → intake HTTP forwarding link across a container boundary**
+so Antithesis can partition, delay, drop, and black-hole it. Everything else stays minimal.
+
+Antithesis injects faults **per container**: anything in one container shares its fate, and links
+*within* a container cannot be faulted. So component placement is dictated by which links we need
+faultable.
+
+> **Post-evaluation routing (catalog now 35 properties).** The primary topology + listener variant
+> cover ~24 of 35. The remaining ~11 route to the two add-ons: **Add-on 1 (config stream)** covers
+> the config/reload cluster — `config-stall-no-deadlock`, `config-incompatible-refuses-start`,
+> `config-runtime-update-not-revalidated`, **`filter-config-reload-correct`**, and the reload facets
+> of `tag-filterlist-applied-consistently`; **Add-on 2 (diff-test)** covers the differential
+> correctness properties — `aggregate-matches-agent`, **`mapper-output-matches-agent`**,
+> **`prefix-filter-ordering-matches-agent`**, and the differential facet of
+> `tag-filterlist-applied-consistently`. The events/service-checks trio
+> (`events-sc-no-silent-loss`, `malformed-event-sc-no-crash`, `events-sc-pipeline-reachable`) runs in
+> the **primary** topology (the workload must emit events + service-checks, not only metrics).
+> `mapper-interner-bounded` runs in the primary topology (high-cardinality mappable names + small
+> mapper interner).
+
+## Primary topology (covers ~24 of 35 properties)
+
+Standalone ADP, fed deterministic load, forwarding to a mock intake — all three on separate
+containers so every link is faultable.
+
+```text
++------------------------+        DogStatsD          +------------------------+      HTTP (Datadog      +------------------------+
+| workload-client        |  (UDP/TCP, faultable)     | adp                    |       intake API,       | mock-intake            |
+| - millstone load gen   | ------------------------> | agent-data-plane       |  faultable, retryable)  | datadog-intake         |
+| - Antithesis SDK       |                           | (standalone mode)      | ----------------------> | (mock fakeintake)      |
+| - test template        | <------------------------ | UDP/TCP/UDS listeners  | <---------------------- | records payloads,      |
++------------------------+   backpressure / health   +------------------------+      acks / 5xx / hang  | queryable for asserts  |
+                                                                                                          +------------------------+
+```
+
+| Container | Role | Image | Runs | Connections | Replicas |
+|---|---|---|---|---|---|
+| `adp` | Service (SUT) | reuse `docker/Dockerfile.agent-data-plane` (standalone build) | `agent-data-plane run` in **standalone mode** (`DD_DATA_PLANE_STANDALONE_MODE=true`, `DD_DATA_PLANE_DOGSTATSD_ENABLED=true`), no Core Agent dependency | receives DogStatsD from `workload-client`; forwards to `mock-intake` over HTTP | 1 |
+| `mock-intake` | Dependency | reuse `docker/Dockerfile.correctness-tools` (the `datadog-intake` binary) | mock Datadog intake; record + count forwarded payloads; expose a query API the workload reads for assertions | receives ADP forwarder traffic; queried by `workload-client` | 1 |
+| `workload-client` | Client (test driver) | new thin Dockerfile layering the `millstone` binary + test template + Antithesis Rust SDK | emits `setup_complete`, then test commands drive `millstone` load and run assertions against `mock-intake` | sends DogStatsD to `adp`; queries `mock-intake` | 1 |
+
+Notes:
+- **Use UDP or TCP, not UDS, between `workload-client` and `adp`.** UDS requires a shared volume
+  (same fate / no faulting), and it couples origin-detection credentials. UDP/TCP keeps the intake
+  *and* the DSD-intake links independently faultable and lets `malformed-dsd-no-crash` exercise the
+  network listeners. (UDS-specific listener behavior can be a secondary case with a shared-volume
+  sidecar — see "Listener-coverage variant".)
+- **Point ADP's forwarder at `mock-intake`** via `DD_URL` / forwarder endpoint config; set a real
+  (fake) API key. This is the link that unlocks the entire egress data-loss cluster.
+- `millstone` already supports deterministic seeds and fixed payload counts (`millstone.yaml`),
+  so the workload is reproducible; Antithesis adds the fault dimension on top.
+
+### What the primary topology covers
+
+- **Memory & resource bounds (Cat A):** high-cardinality / many-timestamp `millstone` corpus +
+  `memory_mode`/`memory_limit` set on `adp`; node-throttling on `adp` to stress the limiter timing;
+  observe RSS vs grant. `rss-bounded-under-cardinality`, `aggregate-context-limit-enforced`,
+  `interner-full-bounded`, `memory-limiter-survives-rss-read-failure` (needs `/proc` fault — see
+  faults), `retry-queue-bounded-under-outage`.
+- **Data integrity & no silent loss (Cat B):** partition / delay / black-hole the `adp↔mock-intake`
+  link, then heal it. `no-silent-interconnect-drop`, `forwarder-eventual-delivery`,
+  `retry-queue-bounded-under-outage`, `shutdown-drains-no-loss`, `source-dispatch-no-misroute`.
+- **Aggregation crash/clock (Cat C subset):** config exploration
+  (`aggregate_window_duration_seconds`, now `NonZeroU64` — the `% 0` panic is closed upstream, so
+  `aggregate-no-panic-any-window` is just a regression tripwire) and clock jitter.
+  `aggregate-no-panic-any-window`, `aggregate-clock-skew-stable`. Sketch internals
+  (`ddsketch-*`, `ddsketch-no-nan-poison`) ride the same workload with SUT-side assertions; the NaN
+  bypass needs a `checks_ipc` Histogram producer (see "checks_ipc note").
+- **Lifecycle (Cat D subset):** SIGINT / node-termination on `adp`. `graceful-shutdown-within-30s`,
+  `data-component-failure-triggers-process-shutdown`, `topology-ready-before-intake`.
+- **Untrusted input (Cat E) + concurrency (Cat F):** adversarial DogStatsD packets from the
+  workload (`malformed-dsd-no-crash`); interner races and non-finite handling ride normal load with
+  SUT-side assertions (`interner-reclamation-no-corruption`, `non-finite-values-handled-consistently`).
+- **Events & service-checks (Cat B/E additions):** the workload must emit well-formed *and*
+  malformed events + service-checks so `events-sc-no-silent-loss`, `malformed-event-sc-no-crash`, and
+  the anti-vacuity anchor `events-sc-pipeline-reachable` are exercised — a metrics-only `millstone`
+  corpus leaves these vacuous.
+- **Transformer correctness (Cat G, primary-runnable subset):** `mapper-interner-bounded` rides a
+  high-cardinality flood of distinct *mappable* names against a small `dogstatsd_mapper_string_interner_size`.
+  The differential Cat G properties (`mapper-output-matches-agent`, `prefix-filter-ordering-matches-agent`)
+  need Add-on 2; the reload ones need Add-on 1.
+
+## Add-on 1 — Core Agent config stream (the `config-*` cluster)
+
+Standalone mode bypasses the remote-agent config stream, so the config-stream properties need a
+fourth container: a **Core Agent (or a minimal gRPC config-stream stub)** that ADP registers against
+and receives snapshot/partial config from.
+
+```text
++------------------------+   gRPC config stream (faultable)   +------------------------+
+| core-agent-stub        | <--------------------------------> | adp (remote-agent mode)|
+| - IPC/gRPC config srvr  |   register / snapshot / partial    | DD_DATA_PLANE_STANDALONE|
+| - Status/Flare/Telem    |                                    |  _MODE=false            |
++------------------------+                                    +------------------------+
+```
+
+| Container | Role | Image | Runs | Why |
+|---|---|---|---|---|
+| `core-agent-stub` | Dependency | reuse `docker/Dockerfile.datadog-agent`, **or** a new minimal stub speaking the remote-agent IPC/gRPC config protocol | serves registration + config snapshot/partial over gRPC | exercises the no-timeout startup wait and runtime config-apply path |
+
+Covers: `config-stall-no-deadlock` (delay/drop the config stream → quiescent indefinite hang at
+`ready().await`, *the* falsification target), `config-incompatible-refuses-start` (send a
+high-severity-incompatible non-default key at startup → expect exit 1),
+`config-runtime-update-not-revalidated` (send the incompatible key *after* startup → observe silent
+apply), and the **Category G runtime-reload cluster** — **`filter-config-reload-correct`** (push
+filter config over the stream while metrics flow; explore `broadcast::Lagged` staleness, partial
+apply, and key-deletion-clears-all-filtering) plus the reload facet of
+`tag-filterlist-applied-consistently` (stale cache after a Lagged-dropped reload). This is the the design partner
+design-partner's documented "Tag Filter RC Relay Stress Test" focus, so the stub must be able to
+send adversarial/partial/laggy filter updates. A **stub is preferred over the full Datadog Agent** for state-space minimality and because we
+need to send adversarial/malformed config the real Agent would never emit; flag as a build task.
+
+## Add-on 2 — Diff-test for `aggregate-matches-agent` (heaviest; optional, separate run)
+
+The differential property needs the Datadog Agent baseline and ADP comparison fed identical load,
+each forwarding to its own intake, then compared. This is the existing `panoramic` correctness setup;
+under Antithesis the comparison runs as a `finally_`/`eventually_` check during a quiet period.
+
+```text
+                +------------------+      +------------------+
+   millstone -->| datadog-agent    | ---> | intake-baseline  |
+   (same seed,  | (baseline)       |      +------------------+
+    fan-out) -->| adp (comparison) | ---> | intake-comparison|  --> finally_: stele diff within ratio
+                +------------------+      +------------------+
+```
+
+This doubles the container count and the state space, so run it as its **own test template/topology**,
+not bundled with the fault-focused primary run. Keep faults light here (fault-induced flush timing
+differences create false diffs unless the comparison runs in an `ANTITHESIS_STOP_FAULTS` quiet window
+long enough to cover `FLUSH_WAIT`≈32s on both sides). Reuse `stele`/`panoramic` analysis logic.
+
+Beyond `aggregate-matches-agent`, this add-on is also where the **Category G differential**
+properties run, with identical config/profiles on both baseline and comparison:
+**`mapper-output-matches-agent`** (identical `dogstatsd_mapper_profiles`, corpus of mappable names),
+**`prefix-filter-ordering-matches-agent`** (corpus where keep/drop depends on stage order), and the
+differential facet of **`tag-filterlist-applied-consistently`** (post-filter name/tags within ratio).
+The corpus must actually exercise mapped/filtered metrics or these (and the `aggregate-matches-agent`
+roll-up's implicit coverage of them) pass vacuously.
+
+## Listener-coverage variant (secondary)
+
+To cover UDS-datagram / UDS-stream listeners and the DogStatsD **replay** properties
+(`replay-no-panic-on-malformed-capture`, `replay-corruption-not-silent-eof`): add a `replay-client`
+that shares a volume with `adp` for the UDS socket and runs the `agent-data-plane dogstatsd replay`
+CLI against **adversarially generated capture files** (the workload synthesizes corrupt/truncated/
+zstd-bomb captures using the SDK's RNG). Replay runs as a separate CLI process, so its panic/OOM is
+isolated from the data plane — install the SUT-side panic/reachability assertions in that CLI process.
+No cross-container faults are needed for replay; it is pure untrusted-input exploration.
+
+## Fault requirements (confirm enabled for the tenant)
+
+Antithesis disables some faults by default. **The user confirmed (2026-05-28) that node
+termination, clock jitter, and custom `/proc` faults are all enabled for this tenant**, so the
+properties that depend on them are realizable rather than vacuous. The custom `/proc` fault still
+needs a script.
+
+| Fault | Needed by | Status |
+|---|---|---|
+| **Node termination** (kill/restart) | `disk-persisted-retry-survives-restart`, `data-component-failure-triggers-process-shutdown`, crash-recovery facets of `forwarder-eventual-delivery` and `aggregate-matches-agent` | **Confirmed enabled** |
+| **Clock jitter** | `aggregate-clock-skew-stable` (and the clock facet of `aggregate-matches-agent`) | **Confirmed enabled** |
+| Network partition / bad-node / congestion | entire egress data-loss cluster (Cat B), `no-silent-interconnect-drop` backpressure | Usually on |
+| Node throttling / CPU modulation | `rss-bounded-under-cardinality` (limiter timing), `memory-limiter-survives-rss-read-failure`, interner-race timing | Usually on |
+| **Custom fault** (`/proc` RSS-read failure) | `memory-limiter-survives-rss-read-failure` — needs a custom fault/script to make RSS unreadable mid-run | **Confirmed enabled** (script still TBD) |
+
+- **Liveness properties** (`forwarder-eventual-delivery`, `disk-persisted-retry-survives-restart`,
+  `shutdown-drains-no-loss`, `config-stall-no-deadlock`) need a quiet window to verify recovery: use
+  `eventually_`/`finally_` commands or `ANTITHESIS_STOP_FAULTS` after healing the intake/partition.
+- The intake-down scenario is also approximable **without** network faults by toggling
+  `mock-intake` into reject/5xx/hang modes via a custom fault or an admin endpoint — useful where
+  network-fault availability is limited.
+
+## SDK selection
+
+- **Workload client:** Rust — Antithesis Rust SDK (assertions + RNG for adversarial input
+  generation: malformed packets, corrupt captures, config keys).
+- **SUT (`adp`):** Rust — add the Antithesis Rust SDK for the **net-new SUT-side assertions** that
+  catalog Categories C/E/F require (NaN-at-sketch-boundary, bin-count, interner-corruption,
+  source-misroute, limiter-RSS-failure, replay-panic, and the aggregate divide-by-zero regression
+  tripwire — now a closed vector, `NonZeroU64`). These internal
+  states are invisible to a workload-only checker (see `property-catalog.md` "Catalog-wide notes").
+  Build a dedicated ADP image with the SDK + Antithesis coverage instrumentation enabled.
+
+## Assumptions & open questions
+
+- **Standalone mode is acceptable for the primary topology.** `AGENTS.md` calls standalone "not for
+  production," but it is the same mode the correctness suite uses and it removes the Core Agent from
+  ~22 properties' state space. Config-stream properties use Add-on 1 with remote-agent mode. Confirm
+  no standalone-only code path masks a production behavior we care about.
+- **`datadog-intake` needs a controllable failure mode** (reject / 5xx / slow / hang, ideally
+  toggleable at runtime) to drive the egress cluster without relying solely on network faults.
+  Confirm whether the existing binary supports this or needs a small extension.
+- **A minimal Core Agent config-stub** must be built (or the full `datadog-agent` image adapted) to
+  send adversarial config the real Agent wouldn't — needed for Add-on 1.
+- Whether the workload can drive DogStatsD over **UDP/TCP at the volume `millstone` targets** without
+  loss confounding the assertions (UDP is lossy by nature; for no-loss assertions prefer TCP/UDS, and
+  scope UDP cases to no-crash rather than no-loss).
+- The `checks_ipc` Histogram NaN bypass (`ddsketch-no-nan-poison`) needs a **checks-IPC producer** in
+  the topology (a check emitting a NaN histogram), which the DogStatsD-only primary topology lacks —
+  add a minimal checks-IPC feeder or a unit-level SUT assertion for that one property.
diff --git a/test/antithesis/scratchbook/evaluation/antithesis-fit.md b/test/antithesis/scratchbook/evaluation/antithesis-fit.md
new file mode 100644
index 00000000000..ee216f5fa54
--- /dev/null
+++ b/test/antithesis/scratchbook/evaluation/antithesis-fit.md
@@ -0,0 +1,361 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence space.
+---
+
+# Antithesis Fit Evaluation — ADP Property Catalog
+
+Lens: for each of the 27 properties, does *verifying* it require exploring a state space that
+deterministic tests cannot reach (timing / concurrency / partial-failure / combinatorial), and does
+the chosen assertion TYPE match that mode? Bias: surface problems, not admire the catalog.
+
+Scope key: `catalog-wide` = affects the whole catalog or a family; `property-specific` = one slug.
+
+---
+
+## Findings
+
+### F1. Two "guaranteed-crash" properties are deterministic config/clock checks, not search problems
+- **Property/slugs:** `aggregate-no-panic-any-window`, `aggregate-clock-skew-stable` (and the
+  catalog-wide note that bills them as "cheap, high-value first targets").
+- **Concern:** Both crashes are *deterministic given an input you already know the shape of*. They are
+  not states Antithesis must *discover* through interleaving; they are reproducible with a one-line
+  unit test. Antithesis's distinctive value (exploring an intractable timing/concurrency space) is
+  near-zero here. The real value Antithesis adds is narrower than the catalog implies: (a) *config-space
+  exploration* reaching the `{"secs":0,"nanos":N}` Duration shape the registry advertises, and (b) for
+  the clock property, a *clock-fault* the deterministic harness cannot inject. Those are genuine but
+  modest; the crash itself is trivially provable without a fuzzer.
+- **`aggregate-no-panic-any-window` specifically:** _Update 2026-05-30 — this defect is now **fixed
+  upstream** (window typed `NonZeroU64`, PR #1772); the analysis below is the historical record of
+  the original bug. The property survives only as a regression tripwire._ Confirmed deterministic at
+  the time — `mod.rs:818`
+  `timestamp % bucket_width_secs` panics on `bucket_width_secs == 0`, and `mod.rs:630`
+  `step_by(bucket_width_secs as usize)` panics on 0; `bucket_width_secs = bucket_width.as_secs()`
+  (`mod.rs:553`) truncates any sub-1s window to 0. There is no validation (grep confirms only
+  definition/default/use sites). This is a single bad config value → guaranteed panic on the first
+  metric. A `#[test]` constructing `AggregationState::new(Duration::from_millis(500))` and inserting
+  one metric proves it in milliseconds with zero search budget. **It does not need Antithesis to find
+  it; it needs a config validator and a unit test.** Verified further: the aggregate transform reads
+  `window_duration` only at construction (`mod.rs:92-93`, `:213`) — no `ConfigChangeEvent`/subscribe
+  wiring — so this is a *startup-config-only* vector. The catalog's open question "can the gRPC stream
+  push this at runtime?" resolves to **no** for this transform (it is not hot-reloaded), which removes
+  the one angle that would have made it a live runtime-exploration target.
+- **Evidence:** `mod.rs:818,630,553,92-93,213`; registry `aggregate.rs:9`; grep showing no
+  `ConfigChangeEvent` subscription in the aggregate transform.
+- **Suggested action:** Keep both as **config-exploration / fault-injection targets** but explicitly
+  *demote the framing from "headline crash-loop finding" to "cheap config/clock tripwire."* The
+  primary fix and primary regression guard is a load-time config validator (`window >= 1s`) plus a
+  unit test — that belongs in the SUT, not in search budget. In Antithesis, an
+  `assert_unreachable` at the `% 0` site is fine to keep (it costs nothing once instrumented and
+  catches a future runtime-push regression), but do not bill these as where Antithesis "earns its
+  keep." The clock property retains more Antithesis value than the window property (clock-fault
+  injection + the *forward-jump flood bound*, which IS a search-worthy quantitative invariant —
+  see F2).
+
+### F2. `aggregate-clock-skew-stable` is partly Antithesis-worthy, partly a deterministic check — the assertion bundles both
+- **Property/slug:** `aggregate-clock-skew-stable` (property-specific).
+- **Concern:** The property folds two very different things under one slug. The *forward-jump
+  zero-value flood* (`Always(zero_value_buckets.len() <= ceil(flush_interval/window)+slack)`) is a
+  genuine quantitative invariant that benefits from clock-fault injection during a *live flush race* —
+  good Antithesis fit. But the *backward-jump silent gap* and the *pre-epoch `unwrap_or_default()==0`*
+  cases are deterministic: step the clock, observe an empty range — reproducible without search. The
+  `Always(current_time >= self.last_flush)` monotonicity assertion is essentially a unit-testable
+  invariant about a single function. Bundling them means the high-value flood-bound assertion shares a
+  slug with deterministic checks, muddying budget attribution.
+- **Evidence:** `mod.rs:627-635` (zero-value loop), `time.rs:21-26` (`unwrap_or_default`), property
+  file lines 37-55.
+- **Suggested action:** Keep the property, but in the workload prioritize the **forward-jump flood
+  bound** as the search-worthy assertion (it interacts with flush timing and memory). The
+  monotonicity/backward-gap facets are worth an `assert_always`/`Sometimes` but recognize they would
+  be caught by a deterministic clock-step test too; they ride along cheaply once clock-fault is wired.
+
+### F3. `ddsketch-relative-error-bound` is pure library/proptest territory — it is not an Antithesis property
+- **Property/slug:** `ddsketch-relative-error-bound` (property-specific).
+- **Concern:** The property file's own Investigation Log resolves that **ADP never calls
+  `DDSketch::quantile` on the live customer path** — histogram percentiles use raw-sample
+  `HistogramSummary::quantile` (exact, not a DDSketch bound) and distribution sketches ship raw bins to
+  Datadog, quantiled server-side. The only runtime `quantile` caller is the prometheus *internal-
+  telemetry* destination. So `Always(|q_est - v| <= eps_rel*|v|)` cannot be anchored to any production
+  runtime call. The accuracy half is a *mathematical invariant of a pure function* over arbitrary
+  inputs — the textbook definition of a proptest, not a state-space-exploration target. There is no
+  timing, concurrency, partial-failure, or live-state dimension to it.
+- The merge-associativity half ("merge order under interleaving") *sounds* Antithesis-flavored, but
+  the merges happen inside the single-task aggregate transform; the f64 reordering it worries about is
+  deterministic for a given merge order and is again a proptest over `merge(A,merge(B,C))` vs
+  `merge(merge(A,B),C)`. Antithesis would have to construct the same input sets a proptest constructs,
+  with worse shrinking.
+- **Evidence:** property file Investigation Log lines 89-128 (no live quantile call); `lib.rs:56`
+  (agent sketch re-export); `prometheus/mod.rs:346` (the lone non-production caller).
+- **Suggested action:** **Remove from the Antithesis catalog; reassign to proptest/unit territory** (a
+  Hegel/proptest test over the agent `DDSketch` accuracy + merge associativity). If anything from this
+  property survives into Antithesis, it is already covered by `ddsketch-bin-count-bounded` (bin
+  serialization fidelity) and `ddsketch-no-nan-poison` (sum/avg sanity). Demoting to Medium (as the
+  catalog did) understates the problem — its assertion type does not match the testing mode at all.
+  Keeping it as a live `Always` is misleading because there is no live call to assert against.
+
+### F4. `ddsketch-bin-count-bounded` substantially duplicates existing proptests; Antithesis value is a thin regression tripwire
+- **Property/slug:** `ddsketch-bin-count-bounded` (property-specific).
+- **Concern:** Verified that strong proptest coverage already exists:
+  `prop_bin_count_never_exceeds_limit` at `sketch.rs:925`, plus `prop_output_bins_are_sorted_and_distinct`,
+  `prop_output_keys_are_highest_from_input`, and unit tests including
+  `trim_left_respects_bin_limit_with_large_weights`. The invariant is *structurally enforced* —
+  `trim_left` runs after every mutating method. Antithesis cannot explore an input space the proptest
+  doesn't already cover for the *math*; it generates `i16` keys × `u32::MAX` weights already. The
+  genuinely non-duplicative value is narrow and real: a **live regression tripwire** that fires if a
+  *future mutator* (a new insert helper, a new merge path, the future "merge agent-shipped sketches"
+  use case) forgets to call `trim_left` — something a proptest on the *existing* methods cannot catch
+  because it tests the methods that exist today.
+- **Evidence:** `sketch.rs:919-1023` (proptests), `sketch.rs:255,319,358,579` (trim_left at every
+  mutation site), property file lines 38-40, 60-63.
+- **Suggested action:** **Keep, but reframe and down-weight.** It is NOT a state-space-exploration win;
+  it is a cheap *always-on guard against forgetting `trim_left`*. The assertion (`Always(bins.len() <=
+  4096)` at the end of every mutator) is correct and basically free once the SDK is in. But the catalog
+  lists it **High** priority — that is too high for a property that mostly re-states an existing
+  proptest. Recommend **Medium**: the marginal value over proptests is the live-mutator-regression
+  tripwire only, and the `Reachable("trim_left collapsed bins")` anchor is essential or the `Always` is
+  vacuous on real production sketches (which rarely exceed 4096 bins under normal cardinality).
+
+### F5. `config-incompatible-refuses-start` is a deterministic gate check that the existing integration suite already exercises
+- **Property/slug:** `config-incompatible-refuses-start` (property-specific).
+- **Concern:** The gate is a single, *deterministic, ordered control-flow check*: `check_and_warn_config`
+  at `run.rs:157` returns `Err` → `exit(1)` *before* `create_topology`/`build`/`spawn`. There is no
+  timing, concurrency, or partial-failure dimension — given the config, the outcome is fixed. The
+  property file itself notes the existing **integration suite already has "config-check exit codes"
+  cases** (sut-analysis §6). The proposed `assert_unreachable("pipeline spawned with high-severity key")`
+  after `spawn()` is *statically unreachable* (the `?` already returned) — it is a belt-and-suspenders
+  guard against a future reordering regression, not a search target. Antithesis adds essentially nothing
+  over a parameterized integration test that feeds N high-severity keys and asserts exit 1.
+- **Evidence:** `run.rs:157,331-381`, `main.rs:136-146`; integration "config-check exit codes" cases
+  (sut-analysis §6); property file lines 20-49.
+- **Suggested action:** Keep as a **cheap config-exploration target** (the SDK markers cost nothing once
+  added, and config-space exploration can mutate which key/value is injected), but **demote from High**.
+  The deterministic gate is already covered by integration tests; Antithesis's only marginal add is
+  fuzzing *which* key triggers it. Most of the catalog's listed value here is duplicative.
+
+### F6. `config-stall-no-deadlock`: the high-value falsification target was retracted; remaining content is a quiescent-hang check with a weak in-process assertion
+- **Property/slug:** `config-stall-no-deadlock` (property-specific).
+- **Concern:** The property's own Investigation Log **retracts the busy-loop hazard** ("downgrade it
+  from highest-value falsification target to a non-issue") because tonic terminates the stream after one
+  error → 5s backoff. What remains is: drop the snapshot → ADP hangs *quiescently forever* at
+  `ready().await` (no timeout, `lib.rs:694-704`). That is a real and interesting liveness finding, and
+  detecting "indefinite quiescent hang vs progress" is reasonable Antithesis fit (timing of the config
+  stream is the explored dimension). BUT the assertion is weak: there is no clean in-process assertion
+  for "stalled forever" — the file admits the busy-loop guard is "best caught workload-side (monitor
+  CPU/log rate)" and that `Always(no panic)` is "implicit." So the property reduces to two reachability
+  markers (`Sometimes(config received)`, `Reachable(wait entered)`) plus an out-of-band CPU monitor.
+  The `Sometimes(config received)` is trivially satisfiable in the happy path and proves little; the
+  load-bearing "hang is quiescent" check is not an SDK assertion at all.
+- **Evidence:** property file lines 103-118 and Investigation Log 120-187; `lib.rs:694-704` (no timeout).
+- **Suggested action:** Keep — the no-timeout hang is a genuine operational finding worth demonstrating
+  — but **be honest that the verifiable artifact is a workload-side CPU/log-rate liveness check, not an
+  SDK invariant.** Consider whether the real recommendation is a SUT change (add a diagnostic timeout)
+  rather than a test. As written, Antithesis "proves" the hang is quiescent, which is a weak property
+  (it cannot prove "never makes progress" — only observe it didn't within the run).
+
+### F7. `data-component-failure-triggers-process-shutdown` — the `Always` is a temporal property the SDK cannot express in-process; the `Reachable` is the only clean assertion
+- **Property/slug:** `data-component-failure-triggers-process-shutdown` (property-specific).
+- **Concern:** The defensible invariant ("component death is *always* followed by process exit") is a
+  **temporal** property across the process lifetime. The property file itself concedes "To express the
+  Always invariant in-process is awkward (it is enforced by control flow)" and falls back to "a
+  workload-side temporal assertion" via Antithesis query-logs. So the in-process artifact is just
+  `assert_reachable` on the shutdown arm (`run.rs:280-283`) — which proves the path *exists*, not that
+  it *always* fires. Inducing the death is genuinely Antithesis-flavored (panic injection, sub-second
+  window, clean early finish), so the property has real fit; the concern is that the catalog's `Always`
+  framing oversells what the SDK can check. The actual guarantee is structural (one `JoinSet`, one
+  `wait_for_unexpected_finish` arm) and would be better unit-tested at the topology level for the
+  control-flow part.
+- **Evidence:** property file lines 96-108; `running.rs:40-51`, `run.rs:280-283`.
+- **Suggested action:** Keep (fault-induced component death is good Antithesis fit), but split the
+  claim: the `Reachable(death→shutdown path)` is the legitimate SDK assertion; the `Always(death⇒exit)`
+  is a **query-logs temporal check**, not an `assert_always`. Make that explicit so the synthesizer
+  doesn't double-count it as an in-process invariant.
+
+### F8. Source-side panic/divide-by-zero properties depend on whether a data-component panic actually crashes the *process* — verify the fail-stop chain or the no-crash assertions are unfalsifiable
+- **Property/slugs:** `aggregate-no-panic-any-window`, `malformed-dsd-no-crash`,
+  `data-component-failure-triggers-process-shutdown`, `source-dispatch-no-misroute` (catalog-wide
+  interaction).
+- **Concern:** Several "no crash" / "crash is caught" properties hinge on a panic in a *data component
+  task* propagating to a process exit Antithesis can observe. The fail-stop model says a panicking
+  component → `JoinError` → `wait_for_unexpected_finish` → whole-process shutdown → s6 restart. But
+  Antithesis observes a *container that never exits* (s6 restarts ADP in-place, per sut-analysis §6 and
+  deployment-topology). If the container masks the exit, an `Always(process up)` workload assertion
+  (`malformed-dsd-no-crash`) is **trivially satisfied even when ADP is crash-looping** — the container
+  stays up. This is a catalog-wide soundness risk for every "no crash" property whose assertion is
+  workload-side "process up." The catalog half-acknowledges this (it routes crash detection through
+  SUT-side `assert_unreachable` at panic sites) but the deployment topology's "process up" framing for
+  `malformed-dsd-no-crash` (`Always(process up)`) is exactly the unfalsifiable shape under s6.
+- **Evidence:** sut-analysis §6 ("container s6 supervisor restarts ADP on exit, so the container never
+  actually exits"); `malformed-dsd-no-crash` invariant `Always(process up)`.
+- **Suggested action:** Catalog-wide: make every "no crash" assertion SUT-side (`assert_unreachable` at
+  the panic site) **and/or** assert against a restart-count / uptime telemetry, NOT "container process
+  up." Flag for the workload author that s6 auto-restart can make `Always(process up)` vacuously pass.
+  This is the single most important cross-cutting fit hazard.
+
+### F9. `replay-corruption-not-silent-eof` is a data-fidelity property with no fault/timing dimension — input-mutation only, marginal over a fuzz/unit test
+- **Property/slug:** `replay-corruption-not-silent-eof` (property-specific).
+- **Concern:** The "bad thing" is *silent truncation reported as success* — a deterministic function of
+  the input bytes (`reader.rs:84-104`). Given a corrupt length prefix, `read_next` returns `Ok(None)`
+  every time; there is no timing/concurrency/partial-failure. The property file even notes the current
+  tests *assert* this behavior intentionally and that distinguishing truncation from clean EOF "may
+  need a format change." So Antithesis here is just an input fuzzer over a pure parser, and the
+  assertion (`AlwaysOrUnreachable(faithful completion)`) presupposes a SUT change that doesn't exist.
+  This is unit/proptest territory (corrupt-prefix → expected outcome), not state-space exploration.
+- **Evidence:** property file lines 21-43, 89-92; `reader.rs:84-104`.
+- **Suggested action:** Keep at **Medium or lower**, but recognize it as a **fuzz/proptest target on the
+  reader**, riding the same adversarial-capture corpus as `replay-no-panic-on-malformed-capture`. Its
+  Antithesis-specific value is near-zero beyond bundling with the panic property; the real deliverable
+  is a maintainer decision (is silent truncation acceptable?) plus a format/telemetry change.
+
+### F10. Several "Sometimes" anchors risk vacuity or astronomically-unlikely satisfaction; audit reachability budget
+- **Property/slugs:** catalog-wide, sharpest on `interner-reclamation-no-corruption`,
+  `ddsketch-bin-count-bounded`, `forwarder-eventual-delivery`, `disk-persisted-retry-survives-restart`.
+- **Concern:** Many properties pair a safety `Always`/`Unreachable` with a `Sometimes` that must be hit
+  for the safety claim to be non-vacuous. Two distinct hazards:
+  (a) **Hard-to-reach `Sometimes` → vacuous `Always`.** `ddsketch-bin-count-bounded`'s
+  `Always(bins<=4096)` is vacuous unless `Reachable("trim_left collapsed bins")` actually fires — but
+  real production sketches under normal cardinality rarely exceed 4096 distinct keys, so the collapse
+  path may never trigger without a deliberately pathological corpus. If the workload doesn't force it,
+  the property passes while proving nothing.
+  (b) **`Sometimes` requiring a precise race.** `interner-reclamation-no-corruption` needs
+  `Sometimes(drop re-check found resurrected entry)` — the `is_active()` re-check at `map.rs:459`
+  returning true. That is a narrow decrement→lock window. It is *exactly* what Antithesis is good at
+  (so fit is good), but the workload must run a near-full interner with heap-fallback **off** and high
+  churn or the contended path is never pressured (heap-fallback default true defuses it — property file
+  config-deps line 67-69). If the workload uses defaults, the `Sometimes` never fires and the safety
+  `Always` is vacuous.
+- **Evidence:** `ddsketch-bin-count-bounded` lines 75-76; `interner-reclamation-no-corruption` lines
+  72-80, config-deps 67-71.
+- **Suggested action:** Catalog-wide: for every safety property gated by a `Sometimes`, the workload
+  MUST include the configuration/corpus that forces the anchor (small interner, heap-fallback off,
+  high-cardinality corpus that exceeds 4096 keys). Recommend the synthesizer add a "vacuity guard"
+  column: each `Always`/`Unreachable` is only meaningful if its paired `Sometimes` is *demonstrated*
+  reached in the run report. Properties whose `Sometimes` is unreached should be reported as
+  inconclusive, not passing.
+
+### F11. `memory-limiter-survives-rss-read-failure` is good fit but its priority is underestimated and gated behind a custom fault
+- **Property/slug:** `memory-limiter-survives-rss-read-failure` (property-specific) — *underestimated*.
+- **Concern:** This is a textbook Antithesis property — a *partial-failure* (mid-run `/proc` read
+  failure) producing a *silent* fail-open (frozen backoff at 0) that no deterministic test reaches; the
+  damaging case is a *race* (reads fail *before* RSS crosses threshold). The bare `std::thread` death
+  doesn't trigger process shutdown, so it is invisible. Yet the catalog rates it **Medium** ("High if
+  RSS reads can realistically fail post-startup"). Given that the whole memory family is "fails by
+  design under defaults," the *one* runtime protection silently vanishing is arguably the highest-stakes
+  partial-failure in Category A. The Medium rating undersells it. The countervailing fact: it requires a
+  **custom `/proc` fault** (deployment-topology flags it as "Custom; must script") and the limiter must
+  be *explicitly enabled* (default Disabled), so reachability is conditional.
+- **Evidence:** `limiter.rs:99-122` (`.expect()` in the steady-state loop), property file lines 36-50;
+  deployment-topology fault table (custom `/proc` fault).
+- **Suggested action:** Raise to **High conditional on the custom `/proc` fault being scriptable on the
+  tenant** and on the limiter being enabled in that workload variant. If the custom fault cannot be
+  built, the property is *unreachable* and should be reported as such rather than silently passing
+  (ties to F8/F10 vacuity concerns). This is a case where Antithesis value is *underestimated* by the
+  catalog's Medium tag.
+
+### F12. `source-dispatch-no-misroute` — the misroute is "structurally improbable," making the `Unreachable` likely vacuous; the real live hazard (silent uncounted loss) is relegated to a sub-clause
+- **Property/slug:** `source-dispatch-no-misroute` (property-specific).
+- **Concern:** The property file's own analysis (lines 33-45) concludes that with the current
+  `extract`-then-`send_all` ordering, **misroute is structurally impossible** — `extract` removes
+  matching events by predicate and recomputes `seen_event_types` before any send. So the headline
+  `Unreachable(misroute)` is *expected to be vacuously unreachable today*; it only guards a future
+  refactor. The genuinely live, Antithesis-reachable hazard is the **sub-clause**: a `send_all` failure
+  drops the extracted events and there is likely **no counter** for it ("possibly fully silent — a
+  finding," lines 102-104). That silent-loss-under-downstream-error path IS a partial-failure worth
+  exploring, but it is buried as clause (B) under a headline that will read as a vacuous pass.
+- **Evidence:** property file lines 33-45, 84-87, 102-104; `mod.rs:1667-1716`.
+- **Suggested action:** Re-center the property on the **silent-uncounted-loss-on-dispatch-failure**
+  facet (which overlaps `no-silent-interconnect-drop`) and treat the misroute `Unreachable` as a
+  cheap future-regression guard explicitly expected to be unreached. As written, the High-value reading
+  (misroute) is the vacuous one and the live reading (silent loss) is the footnote.
+
+---
+
+## Passes (properties whose Antithesis fit and assertion type are sound)
+
+- **`rss-bounded-under-cardinality`** — Burst-vs-250ms-sample race + cooperative-only backoff is a
+  timing space no deterministic test reaches; `Always(rss <= grant*tol)` with SUT-side
+  `Sometimes(backoff_applied && rss_still_climbing)` is the right shape. Strong fit. (Caveat: needs the
+  limiter enabled and a real RSS reading not masked by the container — relates to F8.)
+- **`retry-queue-bounded-under-outage`** — Sustained-outage saturation + shared circuit-breaker +
+  per-endpoint queues + disk eviction is genuine partial-failure/timing territory; `Always(bytes<=cap)`
+  + `Sometimes(items_dropped>0)` correctly pairs a true invariant with a non-vacuity anchor. Strong fit.
+- **`no-silent-interconnect-drop`** — Backpressure-vs-drop under a slow downstream is exactly an
+  interleaving/partial-failure exploration; `Always(discarded delta==0)` on a wired edge + `Sometimes
+  (backpressure engaged)` is well-matched. Strong fit.
+- **`forwarder-eventual-delivery`** — Liveness after a 5xx/timeout/reset storm then recovery; correctly
+  typed as `Sometimes(all-delivered-after-recovery)` (liveness → progress, not instantaneous Always).
+  Needs a quiet/heal window — fit is good. Strong fit.
+- **`disk-persisted-retry-survives-restart`** — SIGKILL mid-outage + restart + reconcile + poison-file
+  injection is the canonical Antithesis crash-durability scenario; the at-most-once slack caveat and the
+  `Reachable(persistence-active)` non-vacuity guard are correctly identified. Strong fit.
+- **`shutdown-drains-no-loss`** / **`graceful-shutdown-within-30s`** — Shutdown under load with a slow
+  intake pushing past the 30s boundary is a timing race; correctly typed as `Sometimes(clean-drain)` +
+  `AlwaysOrUnreachable(timeout⇒forceful)`. Good fit. (Minor: the two slugs overlap; the catalog already
+  carves who-owns-what cleanly.)
+- **`malformed-dsd-no-crash`** — Adversarial packet input across 4 listener types exploring codec/
+  framing error paths is good fuzz+fault fit; the SUT-side `Unreachable` at codec panic sites is the
+  right artifact. Pass *with the F8 caveat* (don't anchor on container "process up").
+- **`replay-no-panic-on-malformed-capture`** — Untrusted-file fuzzing of an unfuzzed, zero-coverage
+  path with confirmed OOM/zstd-bomb vectors; isolated in the replay CLI process so a panic IS
+  observable (exit). Good fit; the SUT-side `Unreachable` at panic sites + `Reachable(parse executed)`
+  is correct.
+- **`interner-reclamation-no-corruption`** — Real-scheduler exploration beyond the bounded loom model is
+  precisely Antithesis's edge over loom; the overlap/sentinel `Unreachable` is the right artifact. Pass,
+  *conditional on the F10 workload config* (small interner, heap-fallback off) or the `Sometimes` is
+  vacuous.
+- **`aggregate-context-limit-enforced`** — Cardinality-flood × flush-timing × zero-value keep-alive
+  interleaving to hit/clear the breach flag is timing-sensitive; `Always(len<=limit)` + `Sometimes
+  (breached)` + `Sometimes(events_dropped)` is well-formed. Pass.
+- **`interner-full-bounded`** — Fill-the-buffer timing + concurrent intern-vs-drop on the reclamation
+  path; the heap-on/heap-off branch split with matched `Sometimes` anchors is correct. Pass.
+- **`aggregate-matches-agent`** — Differential property under faults (delayed flush, restart mid-window,
+  backpressure) the deterministic `panoramic` harness cannot inject; correctly anchored on the existing
+  diff harness as a `finally_`/quiet-window check, not an in-process assertion. Good fit (heavy; its own
+  topology). Pass.
+- **`ddsketch-no-nan-poison`** — Confirmed LIVE bypass via checks_ipc Histogram → encoder `insert_n`;
+  the SUT-side `Always(sum/avg finite)` at the sketch boundary is correct. *Note:* the poisoning itself
+  is deterministic given a NaN input (so the "find it" value is modest), but routing a NaN through a
+  realistic checks_ipc producer and proving it reaches the encoder boundary across the live topology is
+  more than a unit test. Pass, leaning toward "needs the checks_ipc feeder or it's a unit assertion"
+  (deployment-topology already flags this).
+- **`non-finite-values-handled-consistently`** — Adversarial all-NaN/Inf packets across metric types;
+  the honest framing (ghost-metric expected Unreachable on DSD path, NaN-poison live via non-DSD) is
+  correct and the assertion types match. Pass.
+- **`topology-ready-before-intake`** — Stalling a downstream's readiness / failing a supervisor child is
+  a timing/partial-init exploration; the honest narrowing to "readiness-milestone ordering" (not
+  "no byte read pre-ready") keeps the `Always` falsifiable. Pass.
+- **`config-runtime-update-not-revalidated`** — Inject a high-severity key over the runtime stream after
+  a clean start; this is a control-plane→data-plane path the diff-test never touches, and the
+  Reachable/Unreachable framing matches the open design question. Reasonable fit (Medium is right).
+
+---
+
+## Uncertainties
+
+- **U1 (F8 severity):** I have not confirmed *how* the Antithesis harness observes ADP process exit
+  under the container s6 supervisor — whether snouty/Antithesis sees the inner process restart or only
+  the container. If Antithesis can see inner-process restarts (restart count), the "no crash" workload
+  assertions are salvageable as-is; if it only sees the container, every `Always(process up)` is
+  vacuous. This determines whether F8 is a catalog-wide blocker or a workload-wording nit. sut-analysis
+  §6 strongly implies the container masks exits ("never actually exits"), so I lean toward blocker, but
+  did not verify the harness's process-observation granularity.
+- **U2 (F3 scope):** I treated `ddsketch-relative-error-bound` as fully non-Antithesis based on the
+  property's own resolved finding that quantile is never called live. If a future ADP change starts
+  querying quantiles in-process (e.g. a new local-rollup feature), the property would re-acquire live
+  fit. Flagging that the "remove from catalog" recommendation is contingent on the current no-live-call
+  fact remaining true.
+- **U3 (F1 fix direction):** Whether `aggregate-no-panic-any-window` should be fixed by clamp vs reject
+  vs sub-second support changes whether the surviving Antithesis assertion is `Always(window>=1)` or
+  `Unreachable(% 0)`. Unresolved in the catalog ("needs human input"); my "demote to config-validator +
+  unit test" recommendation holds regardless of fix direction, but the exact SDK assertion depends on it.
+- **U4 (custom-fault availability):** F11's priority bump for `memory-limiter-survives-rss-read-failure`
+  is conditional on a scriptable `/proc` RSS-read fault. I could not confirm the tenant supports custom
+  faults of this kind; deployment-topology marks it "Custom; must script." If unavailable, the property
+  is unreachable, not Medium.
+- **U5 (Sometimes-reachability of bin collapse):** F10(a) assumes real production sketches rarely exceed
+  4096 distinct keys under normal cardinality. I did not measure a realistic millstone corpus's per-
+  sketch key count; if the high-cardinality corpus routinely blows past 4096, the `Reachable(collapse)`
+  anchor fires naturally and the vacuity concern for `ddsketch-bin-count-bounded` is reduced.
diff --git a/test/antithesis/scratchbook/evaluation/coverage-balance.md b/test/antithesis/scratchbook/evaluation/coverage-balance.md
new file mode 100644
index 00000000000..6dbc4e54281
--- /dev/null
+++ b/test/antithesis/scratchbook/evaluation/coverage-balance.md
@@ -0,0 +1,329 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence space.
+---
+
+# Coverage Balance Evaluation — ADP Property Catalog (27 properties)
+
+Lens: evaluate the catalog as a **portfolio**. Walk `sut-analysis.md` section by section; for
+each risk area check whether a property covers it, whether low-risk areas are over-invested, and
+whether the assertion-type mix (safety / liveness / reachability) is appropriate. Cite slugs and
+sut-analysis sections. Evidence is grounded in the SUT tree at the pinned commit.
+
+## Method / what I checked against source
+
+- Read all four scratchbook artifacts and all 27 property slug files (present in
+  `test/antithesis/scratchbook/properties/`).
+- Re-derived the live topology from `bin/agent-data-plane/src/cli/run.rs` (`create_topology`,
+  `add_*_pipeline_to_blueprint`, lines 414-758) and `bin/agent-data-plane/src/config.rs:100-156`
+  (`*_pipeline_required`).
+- Spot-confirmed: events/service_checks are always-wired when DogStatsD is on
+  (`config.rs:135-145`); checks_ipc Histogram has no finiteness guard
+  (`sources/checks_ipc/mod.rs:195`); host_tags lives in `bin/agent-data-plane/src/components/host_tags/`;
+  OTLP/traces/logs/APM pipelines are all wired in `run.rs`.
+
+## Type distribution (portfolio shape)
+
+Counting primary type tags across the 27 properties (compound tags counted by their lead type):
+
+- **Safety-dominant:** ~19 properties are Safety or Safety+X.
+- **Liveness-dominant:** 6 (`forwarder-eventual-delivery`, `disk-persisted-retry-survives-restart`,
+  `shutdown-drains-no-loss`, `topology-ready-before-intake`, `config-stall-no-deadlock`,
+  `graceful-shutdown-within-30s`).
+- **Reachability** appears only as a *secondary* clause (paired with Safety/Liveness); there is no
+  standalone reachability property.
+
+Assessment: the safety/liveness split is reasonable and matches the headline guarantee (no-crash =
+safety, no-loss = both). The portfolio is appropriately weighted toward the two headline families.
+The **reachability** dimension is structurally thin — it exists only as `Sometimes`/`Reachable`
+anti-vacuity riders. That is acceptable for most properties, but see Finding 7: several
+*event/service-check/enrichment* paths have **no reachability anchor at all**, meaning a workload
+could pass the whole catalog without ever exercising them.
+
+---
+
+## FINDINGS
+
+### Finding 1 — Events & service-checks intake→enrich→encode→forward path is essentially uncovered, despite being an always-on production path
+
+- **Property/slugs:** catalog-wide (gap). Tangentially touched by `source-dispatch-no-misroute`
+  (Medium) and `shutdown-drains-no-loss`, but neither asserts events/SC delivery correctness.
+- **Concern:** The catalog is metrics-heavy. When DogStatsD is enabled (the production config),
+  `events_pipeline_required()` and `service_checks_pipeline_required()` are **both true**
+  (`config.rs:135-145`), so `dsd_in.events → events_enrich → dd_events_encode → dd_out` and
+  `dsd_in.service_checks → service_checks_enrich → dd_service_checks_encode → dd_out` are live,
+  always-wired edges (`run.rs:681-684`). These are separate codecs
+  (`lib/saluki-io/src/deser/codec/dogstatsd/event.rs` 394 LOC, `service_check.rs` 312 LOC, each with
+  only 7-8 tests) and separate encoders. No property asserts:
+  (a) events/SC parse robustness (malformed event/SC packets — `malformed-dsd-no-crash` is scoped to
+  the codec generally but its angle and open questions are metric-framing-centric),
+  (b) events/SC eventual delivery or no-silent-loss (the Cat B liveness/safety properties name only
+  metrics/forwarder transactions),
+  (c) the events/SC encoders' own panic/backpressure behavior.
+- **Scope:** Two always-on production sub-pipelines.
+- **Evidence brief:** `run.rs:681-684`, `config.rs:135-145`; codec test counts above; sut-analysis §2
+  (DSD pipeline diagram explicitly shows the events/service_checks branches) — the analysis names
+  them but no property family adopts them.
+- **Suggested action:** A targeted discovery pass on the event/service_check codecs + encoders:
+  malformed event/SC framing (`_e{...}` and `_sc|...` shapes), no-silent-loss on the events/SC edges
+  under backpressure, and an eventual-delivery facet. At minimum add an event/SC `Sometimes(parsed)` /
+  `Sometimes(delivered)` reachability anchor so a metrics-only workload doesn't pass vacuously.
+
+### Finding 2 — Entire trace/APM and logs pipelines have zero properties (component blind spot vs. topology)
+
+- **Property/slugs:** catalog-wide (gap).
+- **Concern:** `run.rs` wires a full **traces/APM pipeline** (`traces_enrich` with `ottl_filter`,
+  `ottl_transform`, `apm_onboarding`, `trace_obfuscation`, `trace_sampler`; `dd_apm_stats`,
+  `dd_stats_encode`, `dd_traces_encode`; `run.rs:551-591`) and a **logs pipeline**
+  (`dd_logs_encode`; `run.rs:506-521`). The catalog has **no property** touching traces, APM stats,
+  OTTL transforms, trace obfuscation/sampling, or logs encoding. The deployment-topology doc also
+  never mentions them. This is the largest component blind spot relative to the actual topology.
+- **Scope:** Multiple transforms + encoders + a forwarder fan-in to `dd_out`.
+- **Evidence brief:** `run.rs:441-457, 506-521, 551-591`; sut-analysis §2 mentions only the DSD/OTLP
+  shape and does not enumerate traces/APM/logs — so this is a gap that exists in *both* the analysis
+  and the catalog.
+- **Suggested action:** Decide explicitly whether traces/APM/logs are **in scope** for ADP's first
+  customers (they may be gated off by default — `traces_pipeline_required` only fires for OTLP-native;
+  logs only for checks/OTLP-native). If out of scope, document the exclusion in the catalog's
+  catalog-wide notes so it's a *deliberate* boundary, not a silent omission. If in scope, at least
+  one no-crash/no-silent-loss property is warranted for the APM-stats and obfuscation transforms
+  (string-heavy, SQL-parsing obfuscation in `trace_obfuscation/sql.rs` is a classic untrusted-input
+  hazard).
+
+### Finding 3 — OTLP pipeline (native + proxy/relay) is named but uncovered; only referenced as a "closed path" in NaN reasoning
+
+- **Property/slugs:** `ddsketch-no-nan-poison`, `non-finite-values-handled-consistently` (mention OTLP
+  only to argue it is *closed*); no property *asserts* OTLP behavior.
+- **Concern:** `add_otlp_pipeline_to_blueprint` (`run.rs:700-758`) has two distinct modes — **native**
+  (`otlp_in` source → metrics_enrich/dd_logs_encode/traces_enrich) and **proxy/relay**
+  (`otlp_relay_in` relay + `local_agent_otlp_out` forwarder to the Core Agent, with a separate
+  `otlp_traces_decode` decoder path). OTLP is an *untrusted-input gRPC/protobuf parse surface*
+  analogous to the replay reader, and it deliberately bypasses aggregation (`run.rs:751-753`). The
+  catalog treats OTLP only as a finiteness-closed branch; there is no malformed-OTLP-no-crash, no
+  OTLP-relay-forwarder-delivery, and no property on the relay→forwarder edge.
+- **Scope:** A second untrusted-input source + a second forwarder (to Core Agent, not Datadog intake).
+- **Evidence brief:** `run.rs:700-758`; catalog mentions "OTLP is closed" in
+  `ddsketch-no-nan-poison` resolved-question and `non-finite-values-handled-consistently` §Resolved.
+- **Suggested action:** Targeted pass: is OTLP enabled for the design partner? If yes, an
+  OTLP-equivalent of `malformed-dsd-no-crash` (malformed protobuf / oversized OTLP request) and a
+  relay-forwarder delivery property are warranted. The `local_agent_otlp_out` forwarder is a
+  *second* egress path the entire Cat B data-loss family ignores.
+
+### Finding 4 — DSD transform chain (mapper / prefix-filter / tag-filterlist / post-agg-filter) has no correctness property despite a documented ordering-bug history
+
+- **Property/slugs:** catalog-wide (gap). `aggregate-matches-agent` (Safety, High) would catch a
+  gross divergence end-to-end but is anchored on the `panoramic` diff harness on happy-path load and
+  is explicitly a *separate, optional* run (deployment-topology Add-on 2); it is not a targeted
+  transform-ordering check.
+- **Concern:** sut-analysis §8 calls out **"moved DSD prefix/filter in front of enrich (pipeline
+  ordering bug)"** as a notable correctness fix in churn history. The live order is
+  `dsd_enrich(mapper) → dsd_prefix_filter → dsd_tag_filterlist → dsd_agg → dsd_post_agg_filter`
+  (`run.rs:674-679`). Tag filtering happens both pre- and post-aggregation. None of the 27 properties
+  asserts transform-ordering invariants or mapper/filter correctness (e.g. a metric that should be
+  prefix-dropped is never aggregated; a mapped name is mapped before filtering). This is a
+  bug-history item (the lens explicitly flags it) that did not map to a property.
+- **Scope:** Four chained transforms on the hottest metrics path, with regression history.
+- **Evidence brief:** `run.rs:638-679`; sut-analysis §8 "moved DSD prefix/filter in front of enrich".
+- **Suggested action:** Either (a) explicitly fold transform-ordering correctness into
+  `aggregate-matches-agent`'s scope (the diff harness *would* catch a reordering regression if the
+  workload includes prefix-filtered / mapped / tag-filtered metrics — confirm the `millstone` corpus
+  does), or (b) add a focused ordering property. Currently it relies on an optional, happy-path,
+  separate-run diff test — disproportionately weak for a known regression hotspot.
+
+### Finding 5 — Origin detection / workload-tagger enrichment correctness is uncovered
+
+- **Property/slugs:** catalog-wide (gap). `source-dispatch-no-misroute` touches the source but not
+  enrichment/tagging.
+- **Concern:** sut-analysis §2 and §9 highlight origin detection via **UDS peer credentials**
+  (credential errors counted but *do not drop the packet*) and the workload-tagger/workloadmeta
+  enrichment. `host_enrichment` and `host_tags` transforms run on every metric/event/SC
+  (`run.rs:482-489, 648-655`); `origin.rs` resolves origin tags. No property asserts enrichment
+  *correctness* (right tags attached, no cross-contamination of origin between concurrent packets on
+  a shared socket, behavior when peer-cred lookup fails). Given multi-tenant origin detection is a
+  correctness-critical and concurrency-sensitive area (per-packet credential lookup under a shared
+  UDS listener), the zero-property coverage is a disproportionate gap.
+- **Scope:** Origin resolver + two enrichment transforms, all on the hot path.
+- **Evidence brief:** `sources/dogstatsd/origin.rs`, `transforms/host_enrichment/mod.rs`,
+  `bin/.../components/host_tags/`; sut-analysis §2 "Origin detection uses UDS peer credentials;
+  credential errors are counted but do not drop the packet", §9.
+- **Suggested action:** Discovery pass on origin/tagger enrichment: (a) does a peer-cred failure
+  ever attach *another* connection's origin tags (cross-tenant tag leak — a silent data-corruption
+  hazard)? (b) is host_tags' gRPC dependency (it is built `from_configuration().await` and only in
+  non-standalone mode, `run.rs:486-489`) able to hang/deadlock enrichment if the tagger stream
+  stalls — analogous to `config-stall-no-deadlock`? Note: the **primary topology uses UDP/TCP, not
+  UDS** (deployment-topology), so origin detection is *structurally unexercised* by the primary run —
+  a second blind spot the topology choice creates.
+
+### Finding 6 — dsd_stats statistics tap and dsd_debug_log destination have no property
+
+- **Property/slugs:** catalog-wide (gap). `no-silent-interconnect-drop`'s open question asks whether
+  `dsd_stats_out`/`dsd_debug_log_out` can have zero senders — i.e. the catalog *noticed* these
+  destinations but did not adopt them.
+- **Concern:** `dsd_stats_out` is wired off `dsd_in.metrics` unconditionally (`run.rs:686`);
+  `dsd_debug_log_out` conditionally (`run.rs:688-693`). These are extra fan-out consumers on the
+  busiest output (`dsd_in.metrics` fans to dsd_enrich, dsd_stats_out, and optionally
+  dsd_debug_log_out). Per sut-analysis §4, fan-out delivers *sequentially* and a slow consumer stalls
+  the others — so a slow/blocked `dsd_stats_out` or `dsd_debug_log_out` could backpressure the entire
+  metrics intake. No property tests this fan-out-stall hazard on these taps.
+- **Scope:** Two destinations on the primary metrics output's fan-out.
+- **Evidence brief:** `run.rs:672, 686, 688-693`; sut-analysis §4 "an output with N senders … one
+  slow consumer stalls delivery to the others"; `no-silent-interconnect-drop` open question.
+- **Suggested action:** Either fold the dsd_stats/debug-log fan-out into
+  `no-silent-interconnect-drop`'s scope (assert a blocked tap backpressures rather than drops, and
+  resolve its own open question about zero-sender cases), or add a fan-out-stall reachability anchor.
+  Low-cost since it rides the primary topology.
+
+### Finding 7 — SO_REUSEPORT UDP autoscaling has no property
+
+- **Property/slugs:** catalog-wide (gap).
+- **Concern:** sut-analysis §2 calls out `SO_REUSEPORT` UDP autoscaling on Linux
+  (`sources/dogstatsd/mod.rs:667-686`, also in `lib/saluki-io/src/net/listener.rs` and
+  `net/unix/linux.rs`). Multiple worker sockets bound to the same port is a concurrency/sharding
+  surface: packet distribution across workers, per-worker buffer-clear-and-continue interacting with
+  shared codec state, and worker count scaling under load. No property addresses multi-listener
+  behavior; `malformed-dsd-no-crash` implicitly assumes a single listener.
+- **Scope:** UDP listener sharding (Linux production default for high-throughput DSD).
+- **Evidence brief:** `sources/dogstatsd/mod.rs` REUSEPORT refs; sut-analysis §2.
+- **Suggested action:** Confirm whether REUSEPORT autoscaling is on by default and at what worker
+  count; if multi-worker, a no-crash / no-loss property under concurrent multi-socket load is
+  warranted (also strengthens `malformed-dsd-no-crash` and `interner-reclamation-no-corruption`,
+  which assume the real scheduler but a single read loop).
+
+### Finding 8 — Internal supervisor / control-plane restartability is not a property (only the fail-stop data side is)
+
+- **Property/slugs:** `data-component-failure-triggers-process-shutdown` covers the *data* side;
+  no property covers the *supervised internal* side.
+- **Concern:** sut-analysis §2 stresses the **crucial split**: the internal supervisor (control
+  plane, internal telemetry, env/workload) *is* restartable (OneForOne/OneForAll bounded by
+  intensity/period), but the data topology is fail-stop. The catalog has a strong property for the
+  fail-stop side but **nothing** asserting the supervised side actually restarts correctly within
+  its intensity/period bound, or that exceeding intensity escalates (does a crash-looping internal
+  child eventually take down the process, or spin forever?). `graceful-shutdown-within-30s`'s open
+  question even notes "the internal supervisor shutdown has **no timeout**" — an unguarded path with
+  no property.
+- **Scope:** The entire restartable half of the supervision model.
+- **Evidence brief:** `bin/agent-data-plane/src/internal/mod.rs`, `internal/control_plane.rs`,
+  `runtime/supervisor.rs`; sut-analysis §2 "Erlang/OTP-style Supervisor … OneForOne/OneForAll …
+  bounded by intensity/period"; `graceful-shutdown-within-30s` open question (no internal-supervisor
+  timeout).
+- **Suggested action:** Add a property: induce an internal-supervisor child failure (telemetry /
+  workload / control-plane) and assert (a) it restarts within the intensity/period bound
+  (`Sometimes(child restarted)`), (b) exceeding intensity escalates to a bounded outcome (not an
+  infinite restart spin), and (c) the no-timeout internal shutdown does not let total shutdown exceed
+  the operational expectation. This is the complement to the fail-stop property and is currently the
+  most under-covered architectural half.
+
+### Finding 9 — API-key rotation mid-run surviving in the retry queue is not a property
+
+- **Property/slugs:** `forwarder-eventual-delivery`, `retry-queue-bounded-under-outage`,
+  `disk-persisted-retry-survives-restart` all touch the retry queue but none exercise API-key
+  rotation.
+- **Concern:** sut-analysis §2 and §8 both flag that retry-queue IDs were *stabilized to survive
+  API-key rotation* (now load-bearing). This is a churn-history correctness fix with no property.
+  A key rotation mid-outage could (regression) re-key queued transactions such that a persisted entry
+  no longer matches its endpoint, dropping or duplicating data on recovery — exactly the durability
+  surface `disk-persisted-retry-survives-restart` cares about, but rotation is never injected.
+- **Scope:** Retry-queue identity stability across credential change.
+- **Evidence brief:** sut-analysis §2 "Retry-queue IDs are built to survive API-key rotation", §8
+  "stabilize additional-endpoint retry-queue IDs (now load-bearing for API-key rotation)";
+  `common/datadog/io.rs`.
+- **Suggested action:** Add an API-key-rotation fault dimension to the existing forwarder/retry
+  properties (rotate the API key via config-stream update during an intake outage, then heal, and
+  assert no-loss/no-dup recovery). Cheap to fold into `disk-persisted-retry-survives-restart` or
+  `forwarder-eventual-delivery` as an additional fault rather than a new property.
+
+### Finding 10 — Two bug-history items mapped; two did not
+
+- **Property/slugs:** `aggregate-matches-agent`, catalog-wide.
+- **Concern:** Lens asks which sut-analysis §8 fixes map to properties. Of the four named
+  correctness fixes — *histogram compensated summation*, *unitless histogram counts*, *timestamped
+  count sampling*, *prefix/filter ordering* — only the latter two are even *implicitly* reachable via
+  `aggregate-matches-agent`'s diff harness, and prefix/filter ordering is weakly covered (Finding 4).
+  **Histogram compensated summation** and **unitless histogram counts** have no dedicated property;
+  they would only surface in the diff test if the `millstone` corpus happens to include the
+  triggering histogram shapes and the 1e-8 ratio catches the drift. Compensated-summation regressions
+  are precisely the kind of small-magnitude error a 1e-8 ratio might mask under reordered merges
+  (related to `ddsketch-relative-error-bound`, demoted to Medium/library-only).
+- **Scope:** Two histogram-accuracy regression classes.
+- **Evidence brief:** sut-analysis §8 "compensated summation for histograms; unitless histogram
+  counts".
+- **Suggested action:** Confirm the diff-test corpus exercises histogram metrics with values chosen
+  to expose summation error (large + small magnitude mixed), and that the ratio is tight enough.
+  Otherwise these regressions are unguarded. Could be a sub-clause of `aggregate-matches-agent` or a
+  histogram-specific accuracy property.
+
+### Finding 11 — Possible over-investment: DDSketch library internals carry 3 properties, 2 demoted to library-only
+
+- **Property/slugs:** `ddsketch-bin-count-bounded` (High), `ddsketch-relative-error-bound`
+  (Medium, demoted), `ddsketch-no-nan-poison` (High), and `non-finite-values-handled-consistently`
+  (Medium) which overlaps the NaN facet.
+- **Concern:** Four properties cluster on DDSketch/non-finite correctness. The catalog itself demotes
+  `ddsketch-relative-error-bound` to "a library property, not a live ADP runtime invariant"
+  (its own §Resolved: ADP does not call `quantile` on the live path) and notes
+  `non-finite-values-handled-consistently` overlaps `ddsketch-no-nan-poison`. So ~2 of the 4 are
+  partially redundant / not live-path. This is mild over-investment relative to, say, the
+  zero-coverage events/SC/traces/OTLP areas (Findings 1-3).
+- **Scope:** Sketch correctness cluster.
+- **Evidence brief:** `ddsketch-relative-error-bound` §Resolved (quantile not on live path);
+  `non-finite-values-handled-consistently` Priority Medium with overlap note;
+  `ddsketch-no-nan-poison` is the one genuinely live, High-value member.
+- **Suggested action:** Keep `ddsketch-no-nan-poison` (confirmed-live, High) and
+  `ddsketch-bin-count-bounded` (live regression tripwire). Consider merging
+  `ddsketch-relative-error-bound` and the NaN facet of `non-finite-values-handled-consistently` into
+  harness-side library tests rather than Antithesis runtime properties, freeing portfolio attention
+  for the uncovered pipelines. Not a correctness error in the catalog — a *weighting* observation.
+
+---
+
+## PASSES (areas where coverage balance is appropriate)
+
+- **Memory & resource bounds (Cat A):** Well-proportioned to its risk. Five properties cover the
+  RSS bound, the context cap, interner spill, RSS-read-failure, and the retry-queue byte cap — each
+  mapping to a distinct sut-analysis §7 attack surface (§7.1-5, 7). The "fails-by-design under
+  defaults" framing is correct and the highest-risk area gets the most attention.
+- **Egress data-loss (Cat B forwarder cluster):** `forwarder-eventual-delivery`,
+  `retry-queue-bounded-under-outage`, `disk-persisted-retry-survives-restart` together cover the
+  circuit-breaker, byte cap, and crash-durability surfaces (sut-analysis §2 egress, §6 gaps 1-3).
+  Strong, non-redundant, correctly liveness-typed. (Gap: API-key rotation, Finding 9.)
+- **Guaranteed-crash config/clock hazards:** `aggregate-no-panic-any-window` and
+  `aggregate-clock-skew-stable` captured the two zero-fault-injection deterministic crashes
+  (sut-analysis §7.8, §7.9). _Update 2026-05-30 — §7.8 (sub-second window divide-by-zero) is now
+  **fixed upstream** (window typed `NonZeroU64`, PR #1772) and demoted to a regression tripwire; the
+  clock-skew forward-jump crash remains live, high value, cheap._
+- **Untrusted DSD + replay (Cat E):** `malformed-dsd-no-crash`, `replay-no-panic-on-malformed-capture`,
+  `replay-corruption-not-silent-eof` cover the codec and the newest/largest replay feature
+  (sut-analysis §6 gap 6, §8). Replay is correctly weighted as the top regression-prone area.
+- **Lifecycle/config (Cat D):** `config-stall-no-deadlock`, `config-incompatible-refuses-start`,
+  `config-runtime-update-not-revalidated`, plus the two shutdown properties and the fail-stop
+  property cover the §7.13 no-timeout wait, the startup gate, and the §2 fail-stop model coherently.
+- **Type mix:** Safety-heavy is correct for a "no crash / no corruption" SUT; liveness is present
+  exactly where progress (delivery, drain, startup) is the contract.
+
+---
+
+## UNCERTAINTIES (need human/team input or a targeted pass to resolve)
+
+- **Are traces/APM/logs/OTLP in scope for the first-customer delivery?** This single answer
+  decides whether Findings 2 and 3 are real gaps or deliberate exclusions. ADP targets Agent 7.80.0
+  with `data_plane.enabled` routing DogStatsD; the catalog and topology both implicitly assume
+  DogStatsD-only. If that assumption is correct, document the exclusion; if not, these are the
+  largest gaps in the portfolio. (Needs team input.)
+- **Does the `millstone` correctness corpus exercise events, service_checks, mapped/prefix-filtered
+  metrics, and adversarial histogram values?** Determines whether `aggregate-matches-agent`
+  implicitly covers Findings 1 (events/SC delivery), 4 (transform ordering), and 10 (histogram
+  summation), or whether those are truly unguarded. (Needs a corpus read — `bin/correctness/`,
+  `millstone.yaml`.)
+- **Is SO_REUSEPORT UDP autoscaling on by default, and at what worker count?** Determines whether
+  Finding 7 is a live multi-listener concurrency surface or a single-worker no-op. (Needs a config
+  default read.)
+- **Is origin detection reachable at all in the planned topology?** The primary topology uses UDP/TCP
+  (no UDS), so peer-cred origin detection (Finding 5) is structurally unexercised. Confirm whether
+  the listener-coverage UDS variant is actually planned to run, else origin enrichment correctness is
+  untested by construction. (Needs topology decision.)
+- **Can a high-severity-incompatible key actually arrive over the config stream?** Open in
+  `config-runtime-update-not-revalidated`; also gates whether the API-key-rotation-via-config-stream
+  fault (Finding 9) is reachable. (Needs Core Agent protocol knowledge / team input.)
diff --git a/test/antithesis/scratchbook/evaluation/implementability.md b/test/antithesis/scratchbook/evaluation/implementability.md
new file mode 100644
index 00000000000..7f97651ccd5
--- /dev/null
+++ b/test/antithesis/scratchbook/evaluation/implementability.md
@@ -0,0 +1,516 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence space.
+---
+
+# Implementability Evaluation — ADP Antithesis Property Catalog
+
+Lens: **can each property actually be CHECKED as planned?** Three sub-questions per property:
+(1) observability — workload-visible vs internal-state, and is the SUT-side instrumentation point
+real and reachable; (2) topology — does the planned deployment support the failure scenario;
+(3) preconditions — can the workload reliably construct the needed state within an Antithesis
+timeline. Bias: surface what blocks a green check.
+
+Verified against code this session (load-bearing for the findings below):
+- `run.rs:91,94,486` — standalone mode is a config flag; `check_and_warn_config` (gate) runs on the
+  *resolved* config regardless of source (`run.rs:157`).
+- `run.rs:446` — `checks_ipc` pipeline is gated on `dp_config.checks().enabled()`, **independent of
+  standalone mode**. `checks_ipc/mod.rs:5-13,39-48,57-77` — the source is a **gRPC server**
+  (`ChecksServer`, TCP :5105) consuming `SendCheckPayloadRequest`; it needs an **external gRPC client
+  speaking `datadog_protos::checks`** to emit anything.
+- `forwarders/datadog/mod.rs:60-66` + `endpoints.rs:157-180` — `with_endpoint_override(dd_url, api_key)`
+  exists; the forwarder CAN be pointed at a mock intake (primary egress topology is implementable).
+- `limiter.rs:54` — the limiter checker is a bare `std::thread::Builder::new()` thread (confirms the
+  silent-death framing of `memory-limiter-survives-rss-read-failure`).
+- `transforms/aggregate/mod.rs:92-93,194` — `window_duration` is read once via serde `as_typed()` at
+  build; **no `ConfigChangeEvent` subscriber** in the aggregate transform (grep: none). So
+  `aggregate_window_duration` is **startup-only**, not runtime-reloadable.
+
+Categories below: **F#** = a finding (something that blocks/weakens the planned check). The summary
+at the end groups Findings / Passes / Uncertainties.
+
+---
+
+## Cross-cutting prerequisite (affects ~all SUT-side properties)
+
+**F0 — The Antithesis Rust SDK is a hard prerequisite for ~17 of 27 properties, and it is not yet a
+dependency.** `existing-assertions.md` confirms zero SDK usage and no `antithesis-sdk` in any
+`Cargo.toml`/`Cargo.lock`. Every property whose check is an in-process `assert_*` (all crash/panic,
+NaN-at-sketch, bin-count, interner-corruption, source-misroute, limiter-RSS-failure, context-limit,
+retry-byte-cap, clock-skew, both replay properties, the lifecycle ordering/shutdown reachability
+markers) is **blocked until a dedicated ADP image is built with the SDK linked in** and the
+assertions are physically added at the cited code sites. The catalog acknowledges this in prose but
+does not treat it as a gating work item with a per-site edit list. **Scope:** topology + build.
+**Suggested action:** make "fork ADP, add `antithesis-sdk` dep, land the named assertions, build a
+second instrumented image" an explicit milestone before any SUT-side property can pass; the
+workload-only properties (below) can run first against a stock image. _Update 2026-05-30 — the SDK
+dep and instrumented build are now in place (ADP `antithesis` cargo feature + bootstrap probe, see
+`existing-assertions.md`); the remaining milestone is landing the per-site property assertions._
+
+The properties that are **workload-only checkable** (no SUT fork strictly required, read telemetry /
+mock-intake / process exit): `aggregate-context-limit-enforced` (counter-anchored), parts of
+`no-silent-interconnect-drop` (counter), `forwarder-eventual-delivery` (intake reconciliation),
+`disk-persisted-retry-survives-restart` (intake reconciliation), `retry-queue-bounded-under-outage`
+(only the *reachability* Sometimes; the byte-cap Always is internal), `config-incompatible-refuses-start`
+(exit code + log), `config-stall-no-deadlock` (CPU/log-rate + progress log), `aggregate-matches-agent`
+(harness diff). Everything else needs the fork.
+
+---
+
+## Category A — Memory & Resource Bounds
+
+### rss-bounded-under-cardinality — PASS (with a topology caveat)
+- **Observability:** OK. RSS read workload-side (or SUT-side from the same `Querier`). Expected-to-fail
+  is the finding, not a blocker.
+- **F1 — the RSS bound is unobservable unless a limit is actually configured, and the cgroup auto-path
+  is a silent trap.** The whole property presupposes `effective_limit_bytes()` exists. Under defaults
+  the limiter is a noop (`accounting.rs:37-40,174-178`) and there is no grant to compare against — so
+  the assertion has no threshold. The workload MUST set `memory_mode`+`memory_limit`. Worse
+  (`rss-bounded-under-cardinality.md:118-139`): a **non-empty `DOCKER_DD_AGENT` env var** silently
+  switches on cgroup auto-detect and changes the baseline. **Action:** pin `memory_mode`/`memory_limit`
+  explicitly in the `adp` container env and assert (or log-scrape) that a non-noop limiter is active,
+  else the run is vacuous; audit the base image for `DOCKER_DD_AGENT`.
+- **Precondition / timing:** the "many distinct timestamped values" inflation path needs sustained
+  high-cardinality load for long enough that interner heap-spill + `SmallVec` growth diverge from the
+  grant. Feasible with `millstone`, but the **container cgroup may OOM-kill ADP before the assertion
+  reads over-grant RSS** (the property's own open question). On a kill, the observable is a process
+  crash, not an RSS reading — still a guarantee violation, but it lands on a *different* property
+  (`data-component-failure...`/no-crash) and can confuse triage. **Action:** give the container
+  headroom above ADP's configured grant so the assertion fires before the kernel kill.
+
+### aggregate-context-limit-enforced — PASS
+- The `Always(len <= context_limit)` is a single-task, lock-free, local invariant — the cleanest
+  SUT-side `Always` in the catalog. `Sometimes(breached)` is anchorable to the existing
+  `events_dropped`/breach counter (workload-readable). **Precondition:** lower `aggregate_context_limit`
+  (e.g. 1000) so the boundary is reachable in-run — straightforward with a cardinality flood. No
+  topology or timing blocker.
+
+### interner-full-bounded — PASS (mode B), PASS-with-precondition (mode A)
+- **Observability:** distinguishing interned/inlined/heap-fallback/dropped needs SUT-side state, as the
+  file says; `intern_fallback_total` exists as a `Sometimes` anchor (mode B, default) and is
+  workload-readable. Mode B is easy.
+- **F2 — mode A (bounded) preconditions are fiddly and fragility-prone.** To fill a fixed interner you
+  must (a) set a *small* `dogstatsd_string_interner_size_bytes`, (b) set
+  `dogstatsd_allow_context_heap_allocs=false` (opt-in, test-only in shipping code per
+  `interner-full-bounded.md:117-124`), and (c) use **names/tags > 31 bytes** so `MetaString` inlining
+  doesn't bypass the interner entirely (`:89-91`). Miss any one and the property is vacuous (never
+  fills, or spills to heap). Plus fragmentation under churn can make `try_intern` return `None` below
+  nominal capacity (open question), which muddies the "bounded == drops at budget" reading.
+  **Action:** bake (a)/(b)/(c) into the mode-A workload corpus and add a `Sometimes(try_intern==None)`
+  guard so a non-filling run is flagged rather than passing green.
+
+### memory-limiter-survives-rss-read-failure — CONDITIONAL (needs a custom fault + SUT fork)
+- **Topology:** requires a **custom `/proc` RSS-read fault** (deployment-topology.md:144 flags it as
+  "Custom; must script"). Antithesis cannot inject this out of the box; someone must write a fault
+  hook that makes `process_memory::Querier::resident_set_size()` return `None` mid-run on the target.
+- **F3 — the failure may be unreachable on the Antithesis Linux target without that custom fault, and
+  the property's whole value hinges on it.** The file's own pivotal open question
+  (`:93-98`): can `resident_set_size()` actually fail post-startup on this platform, or only via
+  injected corruption? If `/proc/self/statm` essentially never fails on the target, the realistic
+  panic is reachable *only* through the scripted fault — making this a fault-injection curiosity, not
+  a production risk. Also requires the limiter to be ON (`memory_mode!=disabled`+limit), i.e. the same
+  config prerequisite as F1.
+- **Observability:** the `.expect()` thread death is **silent** (bare `std::thread`, no shutdown, no
+  metric) — confirmed `limiter.rs:54`. So this is **not workload-observable** at all without the SUT
+  fork: you must replace the `.expect()` site with an `assert_unreachable` (or panic-hook). **Action:**
+  confirm tenant supports the custom `/proc` fault AND that the SUT fork lands the assertion; if the
+  fault is unavailable, demote/park this property — it cannot be checked.
+
+### retry-queue-bounded-under-outage — PASS (split observability)
+- **Topology:** needs `adp↔mock-intake` across a container boundary so the outage is faultable —
+  primary topology provides it. Mock intake needs a controllable reject/5xx/hang mode (topology open
+  question; `datadog-intake` may need a small extension). **Action:** confirm/extend the mock intake's
+  failure-mode toggle (also needed by `forwarder-eventual-delivery`, `shutdown-drains-no-loss`).
+- **Observability:** the byte-cap `Always` is internal to `RetryQueue::push` → SUT fork. The
+  `Sometimes(items_dropped>0)` is telemetry → workload-readable. Disk-cap branch requires disk
+  persistence ON and the silent-fallback (`io.rs:405`) to NOT have fired — the file already flags this
+  (must `assert_unreachable`/log-scrape the fallback or the disk-cap test is vacuous). All tractable.
+- **Precondition:** saturate the queue within a timeline — feasible with sustained load + a held-down
+  intake; size load vs. the 15 MiB default cap. Multi-endpoint fan-out multiplies the bound (decide
+  per-endpoint vs aggregate assertion — affects what threshold you check).
+
+---
+
+## Category B — Data Integrity & No Silent Loss
+
+### no-silent-interconnect-drop — PASS (scope carefully)
+- `events_discarded_total` is emitted and workload-readable; `Always(delta==0)` on wired edges is
+  checkable without a fork (counter scrape), and the discard branch only fires for zero-sender
+  outputs. **Caveat (file's open question):** confirm no production DSD output is ever
+  conditionally-unwired (`dsd_debug_log_out`/`dsd_stats_out`); if one is, scope the `Always` to the
+  always-wired edges or it false-positives. **Precondition:** must actually reach the full-channel
+  state — 128-buffer edges (`built.rs:653`) mean you need a genuinely slow downstream + sustained load;
+  the `Sometimes(backpressure engaged)` anchor needs a stall signal (rising send latency). Tractable
+  via throttling the intake.
+- **Transport note:** this is an internal-edge property; UDP lossiness upstream does NOT confound it
+  (the assertion is on the encoder→forwarder internal edges, not the wire). Fine on UDP or TCP.
+
+### forwarder-eventual-delivery — PASS (TCP/UDS for the input side)
+- **Observability:** primary check is workload-side reconciliation of accepted-retryable vs
+  mock-intake-received → no fork needed for the core; the `Reachable(Error::Open re-enqueue)` anchor
+  needs the fork.
+- **F4 — UDP input confounds the "all accepted ... delivered" reconciliation; this property MUST use
+  TCP or UDS for the DSD ingress.** The liveness claim equates *accepted* input to *delivered* output.
+  If `millstone→adp` is UDP (the topology's default suggestion), packets can be dropped on the wire
+  *before* acceptance, so "accepted" is unmeasurable from the sender side and the reconciliation set is
+  ill-defined. deployment-topology.md:175 hints at this ("for no-loss assertions prefer TCP/UDS"), but
+  the property files don't state it as a hard requirement. **Action:** pin DSD ingress to **TCP** for
+  this property (UDS needs a shared volume → not faultable, and the egress link is what we fault here,
+  so TCP ingress is fine). Same applies to `disk-persisted-retry-survives-restart` and
+  `shutdown-drains-no-loss`.
+- **Precondition:** outage must be **shorter than retry-queue overflow** (else drop-oldest legitimately
+  sheds data and the reconciliation must exclude it). Needs a quiet/heal window for the eventual check
+  (`eventually_`/`ANTITHESIS_STOP_FAULTS`). Both standard.
+
+### disk-persisted-retry-survives-restart — CONDITIONAL (node-termination fault) — PASS otherwise
+- **Topology:** needs the **node-termination/kill fault** (topology table: "Commonly disabled — must
+  enable") + s6 restart. If the tenant has kill disabled, this property can't run.
+- **F5 — the silent in-memory fallback makes this vacuous unless explicitly guarded, and the
+  persistence-active signal is log-only (no metric).** `io.rs:405-408` falls back to in-memory with
+  only an `error!` log. The property file already prescribes treating the fallback as
+  `assert_unreachable` (fork) or log-scraping. Without that, a misconfigured disk path silently turns
+  this into an in-memory test that "passes" while proving nothing. **Action:** enforce the
+  persistence-active check as setup-gating.
+- **Observability/precondition:** reconciliation at the mock intake with transaction-identity dedup
+  (workload-side, OK); tolerate the ~1-txn at-most-once window (delete-before-return). Corrupt-entry
+  poison drop is checkable either by injecting a hand-crafted file or naturally via SIGKILL-mid-write
+  (non-atomic write, confirmed). Use TCP/UDS ingress (see F4).
+
+### source-dispatch-no-misroute — PASS (fork required; misroute structurally improbable)
+- **Observability:** the routing decision is internal — NOT telemetry-visible — so the
+  `Unreachable(misroute)` must be an in-process `assert_unreachable` checking the metrics dispatch
+  buffer is metrics-only (`mod.rs:~1707`). Fork required.
+- **F6 — the failure may be structurally unreachable, risking a vacuous/never-firing assertion.** The
+  file's own analysis (`:33-52`) shows `extract`-then-`send_all` removes matched events from the buffer
+  *before* the send can fail, so a send error causes *loss*, not misroute — the assertion likely never
+  fires. That's fine as a **regression tripwire**, but it means this property cannot demonstrate value
+  in a run (no `Sometimes` can prove the bad state is reachable, because it isn't). **Action:** keep it
+  as a guard, but set expectations that it is a latent-regression assertion, not a falsifiable-now
+  property; pair it with the *loss-counting* sub-claim (B) which IS observable. Also: the two
+  `.expect("...output should always exist")` are real crash sites if a deployment omits an output —
+  worth a separate guard but only reachable by mis-wiring (not normal load).
+
+### shutdown-drains-no-loss — PASS (conditional, intricate preconditions)
+- **Topology:** SIGINT/termination on `adp`; slow/blocked intake to push past 30s (needs the mock-intake
+  hang mode, F4-adjacent). OK in primary topology.
+- **F7 — the "accepted-before-signal that reached a flushed window" set is hard to construct precisely,
+  making the no-loss reconciliation fragile.** Two designed-loss boundaries (open aggregation window
+  dropped unless `flush_open_windows`; 30s forceful stop) mean the reconciliation set must *exclude*
+  open-window and post-timeout data. Determining exactly which input metrics "reached a flushed window"
+  at the instant of the signal requires knowing the aggregate flush cadence vs. the signal time — a
+  timing-coupled boundary the workload can only approximate. **Action:** set `flush_open_windows=true`
+  to remove one boundary, drive only *closed-window* (pre-signal, aged > window) data into the no-loss
+  set, and assert the clean case as `Sometimes` (not `Always`). Use TCP/UDS ingress (F4). Realistic
+  drain time near 30s under max load is itself a finding (size load conservatively).
+
+---
+
+## Category C — Aggregation & Sketch Correctness
+
+### aggregate-matches-agent — CONDITIONAL (separate heavy topology) — implementable but expensive
+- **Topology:** Add-on 2 (datadog-agent baseline + adp + two intakes + panoramic/stele). Doubles
+  containers and state space; must run as its own template.
+- **F8 — fault injection and the differential check are in fundamental tension; the diff is only valid
+  in a fault-free quiet window, which limits what this property actually tests.** Injected scheduler
+  pauses/clock steps create *timing-artifact* diffs (delayed flush → different bucket) that are false
+  positives, not correctness bugs. The topology doc concedes the comparison must run inside an
+  `ANTITHESIS_STOP_FAULTS` window long enough to cover `FLUSH_WAIT≈32s` on both sides. So the property
+  largely re-runs the existing deterministic diff test under Antithesis with faults *paused* — the
+  net-new coverage (equivalence *under* faults) is the hardest part and is exactly where false diffs
+  bite. **Open implementability questions unresolved:** can `panoramic` survive an ADP restart mid-run
+  (it may assume a single long-lived process)? Is the Agent baseline's bucket width pinned identical to
+  ADP's window (else the stele `interval_a==interval_b` check, metrics.rs:171, false-positives)?
+  **Action:** treat as a low-fault, quiet-window equivalence run; verify `panoramic` restart-tolerance
+  before committing; this is the highest-effort, lowest-certainty property to operationalize.
+
+### aggregate-no-panic-any-window — PASS (cheapest high-value target)
+- Deterministic crash from config alone; no fault injection needed, just config-space exploration of
+  the `{"secs":0,"nanos":N}` shape. The `Unreachable` at `align_to_bucket_start`/`step_by` needs the
+  fork to be a *clean* signal, but even **without** the fork the panic → process exit → s6 crash-loop
+  is workload-observable (no listener / repeated restart). **Resolved here:** the runtime-config-push
+  open question is **NO** — `window_duration` is read once at build (`mod.rs:92-93,194`), no
+  `ConfigChangeEvent` subscriber, so this is a **startup-only** crash vector. Update the property to
+  drop the "gRPC live-push" angle.
+
+### aggregate-clock-skew-stable — CONDITIONAL (clock-jitter fault) — PASS otherwise
+- **Topology:** needs **clock-jitter fault** ("Commonly disabled — must enable"). If unavailable, the
+  property can't run.
+- **Observability:** `zero_value_buckets.len()` and the `last_flush`/`current_time` pair are internal →
+  fork required for the crisp `Always(bounded)`/`Always(monotonic)`. Downstream flood/gap is only
+  *indirectly* visible workload-side (a spike in zero-value points at the mock intake) — a weaker proxy.
+- **Precondition:** step the container clock forward (flood) / backward (gap) *during* a flush — the
+  `Sometimes(clock jumped during flush)` anchor confirms coincidence. Forward-jump flood is easy to
+  observe (memory + point count). All tractable given the fault. **Action:** confirm clock fault
+  enabled; land the SUT-side bucket-count assertion.
+
+### ddsketch-bin-count-bounded — PASS (fork; needs to drive >4096 bins)
+- `bin_count()` is internal → fork. **Precondition:** must push a sketch past 4096 bins so `trim_left`
+  fires and the `Reachable("trim_left collapsed")` is non-vacuous — needs thousands of distinct
+  histogram/distribution sample values per flush. Feasible via `millstone` distribution corpus.
+  Otherwise the `Always` is vacuously true. No topology blocker (rides normal DSD load).
+
+### ddsketch-relative-error-bound — PASS as a HARNESS/library test only (NOT a live runtime check)
+- **F9 — this is not checkable as a live ADP runtime invariant; it can only be a SUT-side unit/harness
+  assertion.** The file resolves (`:104-128`) that ADP **never calls `DDSketch::quantile` on the
+  customer path** — it ships raw bins; quantiles are computed server-side. So there is no production
+  call site to anchor `Always(quantile within eps)`. The only realizable form is an in-tree test-harness
+  assertion over the agent sketch in isolation (essentially the existing proptests with SDK
+  annotations). **Action:** reframe explicitly as a library-invariant harness check (the catalog
+  already demotes it to Medium and says this); do not plan a topology/workload path for it. Merge-order
+  facet (f64 `avg`/`sum` non-associativity) is likewise only meaningfully testable harness-side.
+
+### ddsketch-no-nan-poison — CONDITIONAL — the planned producer needs a NOT-YET-BUILT gRPC feeder
+- **F10 — the only live NaN path requires a checks_ipc gRPC producer that the primary (DSD-only)
+  topology lacks and that nobody has built; without it the property is unreachable.** Confirmed this
+  session: `checks_ipc` is a **gRPC server** (`ChecksServer` on TCP :5105, `checks_ipc/mod.rs:5-13,
+  39-77`) consuming `SendCheckPayloadRequest`; the NaN bypass (`checks_ipc/mod.rs:195` → encoder
+  `insert_n`) only fires if some client sends a Histogram with a NaN value. The DSD path is
+  finiteness-gated (FloatIter), OTLP is gated, the aggregate path is DSD-only — so **no DSD workload can
+  reach the poisoning site.** deployment-topology.md:177-179 hand-waves "add a minimal checks-IPC
+  feeder or a unit-level SUT assertion." That feeder is a real build task: a gRPC client speaking
+  `datadog_protos::checks` that emits a NaN histogram. The source is enabled by `dp_config.checks().
+  enabled()` and works in standalone mode (no Core Agent needed for checks_ipc itself), but the
+  *producer* is missing. **Action:** EITHER build the checks_ipc NaN feeder (client + enable the
+  pipeline) for an end-to-end check, OR fall back to a SUT-side unit assertion at the sketch boundary
+  (`adjust_basic_stats`) exercised by an in-process test — the catalog should pick one explicitly, not
+  leave it as "or." The sketch-boundary `Always(v.is_finite())` assertion itself is sound and is the
+  right fix location.
+
+---
+
+## Category D — Lifecycle & Configuration
+
+### topology-ready-before-intake — PASS (reframed to milestone ordering)
+- The file already honestly narrows this to **readiness-milestone ordering** (sup-ready before
+  build/spawn; eventually all_ready), NOT "no bytes read before ready" — because the source binds and
+  reads gated only by backpressure. The defensible assertion is log/flag ordering (`sup_ready_ms`
+  before `topology_ready_ms`), which is workload-observable from logs even without the fork.
+  **Uncertainty (open question):** does `dsd_in` bind listeners during `initialize()` before
+  `mark_ready`? If so a `port_listening` probe pre-ready would show binding-before-ready — strengthens
+  or weakens the claim. Needs a read of the DSD listener-bind vs mark_ready ordering to finalize
+  assertion strength. Not a blocker; just bounds the claim.
+
+### config-stall-no-deadlock — PASS — but the busy-loop falsification target is dead
+- **Topology:** needs Add-on 1 (core-agent-stub or minimal gRPC config-stream stub) — a **build task**
+  (topology open question). Standalone mode bypasses the config stream entirely, so this property is
+  N/A without the stub.
+- **F11 — the headline "busy-loop" falsification target is unreachable through the tonic stack; the
+  real (and only) check is a quiescent-hang detector.** The file's investigation (`:120-187`) resolves
+  that a steadily-erroring stream terminates after one error and backs off 5s — no spin. So the
+  realizable check is: drop the snapshot → ADP blocks **quiescently** (low CPU) at `ready().await`
+  forever, no panic. That is checkable workload-side (CPU/log-rate monitor + "Waiting for initial
+  configuration" present, "Initial configuration received" absent). **Action:** drop the busy-loop
+  scenario from the workload; keep the quiescent-hang + flap-reconnect(5s) scenarios. The stub must be
+  able to register ADP then withhold the snapshot — confirm the stub supports that.
+
+### config-incompatible-refuses-start — PASS (workload-observable)
+- Exit code 1 + absence of `topology_ready_ms` + no data at intake — fully workload-observable; the
+  `Unreachable`-after-spawn and `Reachable`-at-refusal markers are nice-to-have fork additions but not
+  required for the core check. **Precondition:** the workload must supply a **current `Severity::High`
+  non-default key** from `config_registry/datadog/unsupported.rs`; that list drifts, so pin it to the
+  commit or source it dynamically. Needs Add-on 1 stub to deliver config (or bootstrap YAML/env, which
+  also works — `check_and_warn_config` runs on the resolved config regardless of source, confirmed
+  `run.rs:157`). So this is actually runnable **without** the stub via env/YAML config — easier than
+  the topology doc implies.
+
+### config-runtime-update-not-revalidated — CONDITIONAL — reachability depends on an unanswered product
+question
+- **Topology:** Add-on 1 stub, in remote-agent mode (must push a runtime `Partial`/`Snapshot` carrying
+  a high-severity key).
+- **F12 — the property's reachability hinges on a `(needs human input)` product question the catalog
+  hasn't resolved: can a `Severity::High` key actually traverse the config stream, or does the Core
+  Agent pre-filter it?** If the real Agent never emits such a key, only an adversarial stub can, which
+  makes this a "the gate doesn't exist at runtime" documentation finding rather than a falsifiable
+  property. The check itself (Reachable on the unguarded apply path, or Unreachable on running-with-key)
+  is implementable via the stub, but its *value* is gated on the product answer. **Action:** get the
+  team's answer before investing; if startup-only gating is intentional, demote to a documented gap +
+  a single `Reachable` marker.
+
+### graceful-shutdown-within-30s — PASS (conditional on fault + scope to topology)
+- SIGINT clean case (bounded load) + wedged-intake forceful case. Reachability markers on both branches
+  of `shutdown_with_timeout` (fork) or log-observable (`"All components stopped."` /
+  `"Forcefully stopping topology"`). **Caveat (file open question):** the **internal-supervisor
+  shutdown has no timeout** (`run.rs:294`), so the *process* can exceed 30s even when the *topology*
+  met it — a workload asserting "process exits within ~35s" can false-fail for an out-of-scope reason.
+  **Action:** scope the assertion to topology-shutdown completion (the log/marker inside
+  `shutdown_with_timeout`), not process exit. Forceful path needs the mock-intake hang mode (F4-adjacent).
+
+### data-component-failure-triggers-process-shutdown — PASS (temporal/log check)
+- Best realized as an Antithesis **query-logs temporal assertion**: whenever "Topology component
+  unexpectedly finished" appears, a process exit follows — workload/triage-side, no fork strictly
+  needed (a `Reachable` marker on the select arm helps). **Precondition:** induce a component finish —
+  cheapest via the sub-second-window panic (`aggregate-no-panic-any-window`) which is a guaranteed
+  deterministic finish. So this property piggybacks on C's crash target. Needs node-termination/kill
+  NOT required (the component finishes on its own). Solid.
+
+---
+
+## Category E — Untrusted Input Parsing
+
+### malformed-dsd-no-crash — PASS (UDP fine here; this is the no-crash property)
+- **Transport:** explicitly the property where **UDP is appropriate** — it tests the connectionless
+  clear-and-continue path; UDP/UDS-datagram listener survival is *part of the property*. The file
+  correctly scopes "socket never dies" to connectionless and excludes TCP `break`. No loss assertion
+  here, so UDP lossiness doesn't confound.
+- **Observability:** `Always(process up)` is workload-side liveness; `framing_errors`/`*_decode_failed`
+  are existing counters for the `Sometimes` anchors. The `Unreachable` at the `unreachable!` /
+  `from_utf8_unchecked` codec sites needs the fork (a parser-regression panic is otherwise just a crash).
+  **Precondition:** SDK-RNG-generated adversarial packets across all 4 listeners — straightforward.
+  Covering UDS-datagram/stream needs the shared-volume sidecar (listener-coverage variant), an extra
+  topology piece but documented.
+
+### replay-no-panic-on-malformed-capture — PASS (instrument the REPLAY CLI process, not ADP)
+- **F13 — the panic assertion must live in the separate replay CLI process, which means a SECOND
+  instrumented binary, and the realistic panic surface is in zstd/prost deps, not ADP code.** Confirmed
+  (`reader.rs` + `dogstatsd.rs:394`): replay parses the file **in the `agent-data-plane dogstatsd
+  replay` CLI process** and forwards payloads over UDS to the running ADP. So (a) the SUT-side panic
+  hook / `assert_unreachable` belongs in the replay CLI, requiring SDK linkage in that code path too;
+  (b) the reader's own two `expect`s are bounds-guarded (not panic sites) — the real risk is a panic
+  *inside* `zstd::stream::decode_all` / `prost::decode` on adversarial input, which is harder to assert
+  on (it'd be a dep panic → SIGABRT). **Also two confirmed resource-exhaustion vectors** (unbounded
+  `fs::read`, uncapped `zstd::decode_all`) are OOM, not panic — they're a *different* observable
+  (process killed) and overlap the memory family. **Action:** instrument the replay CLI; treat OOM
+  vectors as a separate resource property or a size-cap fix; use the listener-coverage variant
+  (shared-volume `replay-client`) — no cross-container faults needed (pure input exploration). **The
+  capture files must be SDK-RNG-generated adversarial bytes** — a workload generator build task.
+
+### replay-corruption-not-silent-eof — CONDITIONAL — likely needs a format change OR stays heuristic
+- **F14 — distinguishing truncation from clean EOF may be impossible without a format change, so a
+  strict `Always` is not implementable today; only a heuristic/`Sometimes` is.** The file resolves
+  (`:89-92`): there is **no record-count or total-length field**, and the code intentionally returns
+  `Ok(None)` for truncation (tests assert it). To assert "completion was faithful" you'd need to know
+  the true record count, which the format doesn't carry. So the realizable check is a SUT-side
+  assertion at `reader.rs:95` distinguishing `size==0 && at_trailer_boundary` (clean) from
+  overrun/mid-stream (corrupt) — which requires the reader to *track* the trailer boundary it doesn't
+  today. Without that instrumentation change, the property degrades to `Sometimes(an overrunning prefix
+  was seen)` — proves the corrupt branch is reachable but NOT that completion is faithful. Plus a
+  `(needs human input)` question on whether maintainers even consider silent truncation a bug.
+  **Action:** scope to the `Sometimes(corruption-reached-the-(b)-branch)` form + the CLI-process
+  instrumentation, and flag the strict-fidelity `Always` as fix-dependent (format change needed).
+
+---
+
+## Category F — Concurrency & Boundary Conditions
+
+### interner-reclamation-no-corruption — CONDITIONAL — relies on Antithesis scheduling to hit the race
+- **F15 — the corruption branch is loom-proven-safe under the modeled interleavings; whether Antithesis
+  can construct an interleaving loom doesn't cover is unknown, so the `Sometimes(contended path hit)`
+  anchors may never fire, risking a vacuous green.** The whole value proposition is "explore beyond
+  loom's bounded model under the real scheduler," but the workload cannot *force* the
+  decrement→lock→re-check race; it can only create pressure (small interner, high-cardinality churn,
+  short-lived contexts) and hope Antithesis schedules the contended interleaving. The `Sometimes(drop
+  re-check found resurrected entry)` / `Sometimes(reclaimed-slot reused)` anchors are the guard against
+  vacuity — but if they never fire, the property neither passes meaningfully nor fails. **Observability:**
+  the corruption check (overlap or sentinel-run in a resolved `&str`) is internal → fork; note the
+  **two different sentinels** (`0x21` in map.rs, `0xAA` in fixed_size.rs) — a hard-coded check would
+  miss one impl; use direct overlap detection. **Precondition:** small interner + heap-fallback OFF so
+  reclamation is actually pressured (else strings spill to heap and reclamation is never exercised).
+  **Action:** land the overlap-based (not sentinel-hardcoded) SUT assertion + the `Sometimes` contention
+  anchors; accept that reachability of the race is Antithesis-scheduler-dependent and report the anchor
+  status in triage so a never-contended run isn't mistaken for a pass.
+
+### non-finite-values-handled-consistently — PASS (DSD facet) / shares F10 (NaN-poison facet)
+- The DSD ghost-metric facet is checkable on the primary topology: all-non-finite packets →
+  `num_points==0` gate (`mod.rs:1478`) → `Ok(None)`; the `AlwaysOrUnreachable(no zero-point metric
+  reaches aggregation)` and `Sometimes(non-finite dropped)` are anchorable (the latter needs a
+  `non_finite_dropped` counter that doesn't exist — currently only a `debug!` log, so add a counter or
+  the fork). The **NaN-at-sketch facet inherits F10** (needs the checks_ipc gRPC feeder to be live; on
+  the DSD path it's correctly Unreachable). **Action:** add the `non_finite_dropped` reachability anchor
+  (counter or SUT marker); keep the sketch-boundary `Always(is_finite)` as the producer-independent
+  assertion but understand its live exercise depends on F10's feeder.
+
+---
+
+## Summary
+
+### Findings (blockers / weakeners — Property | Concern | Scope | Evidence | Action)
+
+- **F0 | SDK not present; ~17 properties need an instrumented ADP fork before they can pass | build/topology |
+  existing-assertions.md (zero SDK); deployment-topology.md:153-161 | Make "fork + add SDK + land named
+  assertions + second image" an explicit gating milestone; run workload-only properties first.**
+- **F1 | rss-bounded-under-cardinality — no grant to assert against under defaults; `DOCKER_DD_AGENT` silently
+  flips the baseline | config | rss-bounded-under-cardinality.md:118-139; accounting.rs:37-40,107-121 | Pin
+  memory_mode+memory_limit, assert non-noop limiter, audit image for DOCKER_DD_AGENT; give container RSS
+  headroom so assertion fires before OOM-kill.**
+- **F2 | interner-full-bounded mode A — three coupled preconditions (small interner, heap-off, >31B strings) or
+  vacuous | config/precondition | interner-full-bounded.md:89-91,117-124 | Bake all three into the corpus +
+  Sometimes(try_intern==None) guard.**
+- **F3 | memory-limiter-survives-rss-read-failure — needs a custom /proc fault that may be the ONLY way to reach
+  the failure; silent thread death is unobservable without the fork | topology+fault+SUT | deployment-topology.md:144;
+  limiter.rs:54,100-102; property open Q :93-98 | Confirm tenant supports the custom fault AND the fork; else
+  park — uncheckable.**
+- **F4 | forwarder-eventual-delivery / disk-persisted-retry / shutdown-drains — UDP ingress confounds no-loss
+  reconciliation; MUST use TCP (UDS needs shared volume) | topology | deployment-topology.md:175; forwarder-
+  eventual-delivery.md:69-74 | Pin DSD ingress to TCP for these three properties.**
+- **F5 | disk-persisted-retry — silent in-memory fallback (log-only, no metric) makes it vacuous unless guarded |
+  observability | disk-persisted-retry-survives-restart.md:121-130; io.rs:405-408 | Gate the run on a
+  persistence-active assert_unreachable/log-scrape.**
+- **F6 | source-dispatch-no-misroute — misroute is structurally unreachable (extract-then-send); assertion can't
+  fire in a run | falsifiability | source-dispatch-no-misroute.md:33-52 | Keep as regression tripwire; pair with
+  the observable loss-counting sub-claim.**
+- **F7 | shutdown-drains-no-loss — the "accepted-before-signal, flushed-window" set is timing-coupled and hard to
+  construct precisely | precondition | shutdown-drains-no-loss.md:55-60 | Set flush_open_windows=true, restrict to
+  closed-window data, assert as Sometimes.**
+- **F8 | aggregate-matches-agent — faults create false diffs; net-new coverage only valid in a fault-paused
+  window; panoramic may not survive restart | topology/method | aggregate-matches-agent.md:89-96; deployment-
+  topology.md:117-120 | Run as a low-fault quiet-window equivalence; verify panoramic restart-tolerance first.**
+- **F9 | ddsketch-relative-error-bound — no live runtime call site (ADP ships raw bins, no quantile); only a
+  library/harness test | observability | ddsketch-relative-error-bound.md:104-128 | Reframe as in-tree harness
+  assertion; no topology/workload path.**
+- **F10 | ddsketch-no-nan-poison (and NaN facet of non-finite) — only-live NaN path needs a checks_ipc gRPC NaN
+  feeder not in the primary topology and not yet built | topology/build | checks_ipc/mod.rs:5-13,39-77,195;
+  run.rs:446; deployment-topology.md:177-179 | Build the gRPC checks feeder OR fall back to a SUT-unit sketch-
+  boundary assertion — pick one explicitly.**
+- **F11 | config-stall-no-deadlock — busy-loop falsification target is unreachable through tonic; needs a config-
+  stream stub | method/topology | config-stall-no-deadlock.md:120-187 | Drop busy-loop scenario; keep quiescent-
+  hang; confirm stub can register-then-withhold-snapshot.**
+- **F12 | config-runtime-update-not-revalidated — reachability gated on an unresolved product question (can a
+  High-severity key traverse the stream?) | product input | config-runtime-update-not-revalidated.md:42-47 | Get
+  team answer before investing; else demote to documented gap + Reachable marker.**
+- **F13 | replay-no-panic — assertion must live in the SEPARATE replay CLI process (second instrumented binary);
+  real panic surface is in zstd/prost deps; +2 OOM vectors | topology/build | replay-no-panic-on-malformed-
+  capture.md:83-130; reader.rs:40-44; dogstatsd.rs:394 | Instrument the replay CLI; SDK-RNG capture generator;
+  treat OOM vectors separately.**
+- **F14 | replay-corruption-not-silent-eof — strict fidelity Always needs a format change (no record count exists);
+  only a heuristic Sometimes is implementable today | format limitation | replay-corruption-not-silent-eof.md:89-92 |
+  Scope to Sometimes(corrupt branch reached) + CLI instrumentation; flag strict Always as fix-dependent.**
+- **F15 | interner-reclamation-no-corruption — race is loom-safe; hitting an un-modeled interleaving depends on the
+  Antithesis scheduler; Sometimes anchors may never fire (vacuous) | scheduler-dependence | interner-reclamation-
+  no-corruption.md:55-128 | Use overlap-based (not sentinel-hardcoded) assertion; heap-off + small interner; report
+  anchor status so a never-contended run isn't read as a pass.**
+
+### Passes (implementable as planned, modulo the SDK fork in F0 and any noted scoping)
+
+- aggregate-context-limit-enforced (cleanest SUT Always; counter-anchored Sometimes).
+- aggregate-no-panic-any-window (config-only crash; resolved: startup-only, drop the runtime-push angle).
+- ddsketch-bin-count-bounded (fork + drive >4096 bins).
+- no-silent-interconnect-drop (counter-readable; scope to always-wired edges).
+- retry-queue-bounded-under-outage (split observability; mock-intake failure-mode toggle needed).
+- config-incompatible-refuses-start (workload-observable via exit code; runnable even without the stub via env/YAML).
+- topology-ready-before-intake (reframed to milestone ordering; log-observable).
+- graceful-shutdown-within-30s (scope to topology shutdown, not process exit).
+- data-component-failure-triggers-process-shutdown (log/temporal check; piggybacks on the C crash target).
+- malformed-dsd-no-crash (UDP appropriate here; counters + liveness; fork for codec Unreachable).
+- aggregate-clock-skew-stable (needs clock fault enabled; otherwise solid).
+- non-finite-values-handled-consistently (DSD ghost-metric facet; needs a non_finite_dropped anchor).
+
+### Uncertainties (need an answer to finalize the check)
+
+- Does `dsd_in` bind listeners during `initialize()` before `mark_ready`? Decides whether
+  topology-ready-before-intake can strengthen to "no socket bound pre-ready" (file open Q).
+- Is any production DSD output (`dsd_debug_log_out`/`dsd_stats_out`) ever conditionally unwired? Decides
+  the scope of no-silent-interconnect-drop's `Always(delta==0)` (file open Q).
+- Can `process_memory::Querier::resident_set_size()` fail post-startup on the Antithesis Linux target
+  without the custom fault? Pivotal for F3's priority (file open Q).
+- Can a `Severity::High` config key actually traverse the Core Agent → ADP config stream? Pivotal for
+  F12's value (file open Q, needs human input).
+- Does the mock `datadog-intake` binary support a runtime-toggleable reject/5xx/slow/hang mode, or does it
+  need extension? Needed by retry-queue, forwarder-eventual-delivery, shutdown forceful-path (topology open Q).
+- Which faults are enabled on the target tenant (node-termination, clock-jitter both "commonly disabled",
+  custom /proc fault)? Several CONDITIONAL properties can't run if these are off (topology fault table).
+- Can `panoramic` survive an ADP process restart mid-run? Decides whether aggregate-matches-agent's
+  restart-equivalence facet is testable at all (file open Q).
diff --git a/test/antithesis/scratchbook/evaluation/synthesis.md b/test/antithesis/scratchbook/evaluation/synthesis.md
new file mode 100644
index 00000000000..2a47198f0f8
--- /dev/null
+++ b/test/antithesis/scratchbook/evaluation/synthesis.md
@@ -0,0 +1,141 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence space — design partner focus on the tag-filter RC relay shaped the bias findings.
+  - path: https://datadoghq.atlassian.net/wiki/spaces/AMCC/pages/6640602441/Tag+Filter+RC+Relay+Stress+Test+agent+ADP
+    why: Confirms runtime filter config-reload is the design-partner's documented test focus.
+  - path: https://github.com/DataDog/saluki/pull/1768
+    why: PR review #4393897611 (Copilot) — the G2 filter-deletion wording and three priority alignments reconciled here.
+---
+
+# Property Evaluation Synthesis
+
+Four evaluation lenses (Antithesis-fit, coverage-balance, implementability, wildcard) stress-tested
+the 27-property catalog as a portfolio. Findings categorized below as **Refinement** (applied
+directly), **Gap** (filled via targeted discovery), or **Bias** (escalated to the user). Lens
+evidence: `evaluation/{antithesis-fit,coverage-balance,implementability,wildcard}.md`.
+
+Outcome: 8 properties added (catalog 27 → 35), 9 refinements applied, 1 scope bias escalated.
+
+## Gaps (filled)
+
+### G1 — Events & service-checks paths uncovered (always-on) → 3 properties
+Coverage-balance F1, wildcard. When DogStatsD is enabled, `dsd_in.events`/`dsd_in.service_checks`
+are always-wired production paths (`run.rs:681-684`) with their own ~394/~312-LOC codecs, yet the
+27-property catalog was metrics-only. **Filled** with `events-sc-no-silent-loss`,
+`malformed-event-sc-no-crash`, `events-sc-pipeline-reachable` (the last is an anti-vacuity anchor so
+a metrics-dominated workload can't pass the first two trivially).
+
+### G2 — ADP-as-transformer correctness + runtime filter config-reload → 5 properties
+Coverage-balance F4/F5/F10, wildcard W1/W2/W6, the design-partner focus. The catalog covered ADP as
+a *transport* but not the mapping/filtering/enrichment *correctness* layer, and treated runtime
+config only as a crash gate — never as a data-correctness event, despite the watcher hazards
+(`broadcast::Lagged` silent drop → stale filtering, `watcher.rs:36-74`; partial-deserialize
+half-apply; key-deletion leaves filtering silently **stale** because the additive `diff_recursive`
+emits no change event, `diff.rs:12-48`, while only an explicit empty/null value **clears** it,
+`tag_filterlist/mod.rs:274-276`) and the design partner's documented "Tag Filter RC Relay Stress
+Test." (Deletion-is-stale vs. explicit-empty-clears is the distinction detailed in
+`filter-config-reload-correct.md` Hazard 3.) **Filled** with `mapper-output-matches-agent`, `mapper-interner-bounded`
+(a *second* bounded interner with its own silent drop), `filter-config-reload-correct` (the watcher
+hazards on live data), `tag-filterlist-applied-consistently`, `prefix-filter-ordering-matches-agent`
+(bug-history-sensitive stage ordering). These need the config-stream add-on topology (not standalone)
+and the diff-test add-on; noted in each evidence file.
+
+### Gaps NOT filled (folded or escalated)
+- **API-key rotation mid-run** (coverage F9): folded as a fault dimension into
+  `disk-persisted-retry-survives-restart` / `forwarder-eventual-delivery` rather than a new property.
+- **Internal-supervisor restartability** (coverage F8): noted as a minor gap; low priority relative
+  to the data path. Left for a future pass.
+- **Traces/APM, logs, OTLP pipelines** (coverage F2/F3, wildcard): escalated as the scope **Bias**
+  below rather than filled — they are the "broader topology, lower priority" the user deferred.
+
+## Biases (escalated to user)
+
+### B1 — Catalog (and SUT analysis) is framed around metrics-DogStatsD transport
+Wildcard W4/W6, coverage F2/F3, multiple uncertainties. Even after gap-filling, the catalog is
+DogStatsD-metrics-centric. The **traces/APM, logs, and OTLP pipelines** (`run.rs:506-591,700-758`) —
+including a SQL-parsing trace-obfuscation untrusted-input surface and a *second* OTLP forwarder — have
+**zero properties** and are absent from the SUT analysis. Whether they are in scope for first-customer
+(Agent 7.80.0) delivery is a product judgment the evaluation can't make. The primary topology
+also uses **standalone mode**, which structurally hides the entire runtime-config surface (the
+watcher never fires) — so the G2 config-reload properties pass vacuously unless the config-stream
+topology is promoted to primary. **Escalated** — see the questions posed to the user. This does not
+block the catalog; it scopes which add-ons and pipelines get instrumented first.
+
+## Refinements (applied)
+
+- **R1 (catalog-wide, important):** the container's s6 supervisor auto-restarts ADP on exit, so
+  "process up" workload assertions are vacuously green even during a crash-loop. **Every no-crash
+  property must assert SUT-side `Unreachable` at panic sites (or on restart-count), never container
+  liveness.** Applied as a catalog-wide note and reflected in `malformed-dsd-no-crash`,
+  `malformed-event-sc-no-crash`, `data-component-failure-triggers-process-shutdown`.
+- **R2 (catalog-wide; updated 2026-05-30):** the Antithesis Rust SDK is now wired into ADP behind the
+  `antithesis` cargo feature (`bin/agent-data-plane/Cargo.toml`) with an `antithesis_init()` +
+  bootstrap `assert_reachable!` probe, and the harness binaries carry workload-side anchors — see
+  `existing-assertions.md`. The **"fork ADP + add SDK + build an instrumented image"** prerequisite is
+  therefore largely satisfied (the wiring is proven end-to-end); what remains is implementing the ~17
+  in-process SUT-side **property** assertions on top of that scaffold. The ~10 workload-only
+  properties can still run first.
+- **R3 (catalog-wide):** no-loss properties (`no-silent-interconnect-drop`, `forwarder-eventual-delivery`,
+  `disk-persisted-retry-survives-restart`, `shutdown-drains-no-loss`, `events-sc-no-silent-loss`)
+  **must use TCP (or UDS), not UDP**, or UDP's inherent loss confounds "accepted == delivered." Noted
+  in the topology and those properties.
+- **R4 (catalog-wide vacuity):** safety properties gated by hard-to-reach `Sometimes` anchors
+  (bin-collapse, interner-resurrection race, events/SC reachability) must have the workload force the
+  anchor config/corpus; the run synthesizer must report **unreached `Sometimes` as inconclusive, not
+  passing**. Added to catalog-wide notes.
+- **R5 `ddsketch-relative-error-bound`:** demoted — it is a **library/proptest invariant, not a live
+  ADP runtime assertion** (ADP ships raw bins, never calls `DDSketch::quantile` on the customer path).
+  Re-scoped to harness-side; priority Medium→Low. (Applied during open-question sync; reaffirmed.)
+- **R6 `ddsketch-bin-count-bounded`:** demoted High→Medium — substantially duplicates existing
+  proptests; genuine Antithesis value is only a live regression tripwire for a future mutator that
+  forgets `trim_left`. The `Reachable(collapse)` anchor is essential or the `Always` is vacuous.
+- **R7 `config-incompatible-refuses-start`:** demoted High→Medium — a deterministic ordered gate
+  already covered by the integration suite's config-check-exit-code cases; kept as cheap config
+  exploration with the `Reachable(refused)` framing (the `Unreachable` is statically unreachable).
+- **R8 `source-dispatch-no-misroute`:** re-centered on the **live silent-loss facet** (dispatch
+  failure loses events with no/under-counting) rather than the structurally-vacuous misroute
+  `Unreachable`; priority kept Medium, framing corrected.
+- **R9 `memory-limiter-survives-rss-read-failure`:** priority noted as **High *conditional* on a
+  scriptable `/proc` custom fault + limiter enabled**; otherwise it is unreachable. Framing clarified.
+- **R10 (de-dup labelling):** marked shared-scenario pairs in `property-relationships.md`
+  (`shutdown-drains-no-loss` ⇄ `graceful-shutdown-within-30s`; `non-finite-values-handled-consistently`
+  ⇄ `ddsketch-no-nan-poison`) so the portfolio count isn't read as 35 independent test efforts.
+
+## Passes (lenses confirmed sound)
+
+- Category A memory bounds — well-proportioned to the highest-risk surface; each property maps to a
+  distinct mechanism.
+- Category B forwarder cluster (eventual-delivery / byte-cap / crash-durability) — strong, non-redundant.
+- The zero-fault clock crash finding `aggregate-clock-skew-stable` (forward-jump facet) — correctly
+  prioritized, cheap, high-value (F1 note: clock vector, not a runtime-discoverable state). Its sibling
+  `aggregate-no-panic-any-window` had its `% 0` panic vector **closed upstream** (window is now
+  `NonZeroU64`, PR #1772, fc4bb297); it is demoted from a live crash bug to a cheap `Unreachable`
+  regression tripwire — see the catalog status note and the bug ledger.
+- `ddsketch-no-nan-poison` checks_ipc bypass — genuine live latent bug, correctly High.
+- Type mix (~safety-heavy with 6+ liveness) appropriate for a no-crash/no-corruption SUT; reachability
+  used correctly as anti-vacuity riders.
+- `host_enrichment` static (correctly no property); mapper uses the backtracking-free `regex` crate
+  (regex-DoS is a non-issue — recorded closed); OTTL panics are traces-only and test-gated.
+
+## Open uncertainties carried forward (need team input)
+
+- Are traces/APM, logs, OTLP in scope for first-customer delivery? (B1)
+- Does the `millstone` corpus exercise events/SC, mapped/filtered metrics, and adversarial histogram
+  values — i.e. does `aggregate-matches-agent` implicitly cover some G1/G2 ground?
+- Can a `Severity::High` config key, or a filter-config update, actually traverse the RC stream at
+  runtime (vs. Core-Agent pre-filtering)? Gates the reachability of the config-reload properties.
+- ~~Which faults are enabled for the tenant~~ **Resolved (user, 2026-05-28):** node termination,
+  clock jitter, and custom `/proc` faults are all enabled — the crash-recovery, clock-skew, and
+  limiter-RSS-failure properties are realizable.
+- Does `datadog-intake` support a runtime failure-mode toggle (reject/5xx/slow/hang)?
+
+## Scope decision (user, 2026-05-28)
+
+The traces/APM, logs, and OTLP pipelines are **deferred** (documented exclusion in the catalog), not
+filled. DogStatsD metrics + events/service-checks + the runtime config/transform surface is the
+first-customer scope. Bias B1 is thereby resolved-as-accepted: the catalog is intentionally scoped,
+not accidentally narrow.
diff --git a/test/antithesis/scratchbook/evaluation/wildcard.md b/test/antithesis/scratchbook/evaluation/wildcard.md
new file mode 100644
index 00000000000..33d28cf872f
--- /dev/null
+++ b/test/antithesis/scratchbook/evaluation/wildcard.md
@@ -0,0 +1,271 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence space.
+  - path: https://datadoghq.atlassian.net/wiki/spaces/AMCC/pages/6640602441/Tag+Filter+RC+Relay+Stress+Test+agent+ADP
+    why: Team-authored stress-test spec for the runtime tag-filter RC relay — the exact surface the catalog under-covers.
+---
+
+# WILDCARD evaluation — ADP property catalog
+
+Bias: find what Lens 1 (Antithesis-Fit), Lens 2 (Coverage-Balance), Lens 3 (Implementability)
+miss. The three lenses accept the SUT analysis's framing, in which ADP is essentially a
+*transport* — its job is to not crash and not lose bytes. That framing is the catalog's blind
+spot. ADP is also a *transformer*: it maps, filters (twice), tag-filters, enriches, and
+aggregates customer data, and most of that transformation is driven by **runtime config that
+mutates while data flows**. The catalog has 27 properties; ~24 are about crash/memory/loss and
+3 (`aggregate-matches-agent`, `aggregate-clock-skew-stable`, the two ddsketch math props) are
+about correctness of values. **Zero** properties cover the map/filter/tag-filter/enrichment
+layer's correctness, and zero cover runtime config-reload as a *data-correctness* event rather
+than a crash/incompatibility event. That is the headline wildcard finding.
+
+---
+
+## FINDING W1 — The runtime config-reload-while-data-flows surface is almost entirely uncovered (catalog-wide / framing miss)
+
+**Concern.** Five production components subscribe to live config updates over the Core Agent RC
+stream and **rebuild correctness-affecting state in place while metrics are flowing through them**:
+
+- `dogstatsd_prefix_filter` — watches 4 keys (`metric_filterlist`, `metric_filterlist_match_prefix`,
+  `statsd_metric_blocklist`, `statsd_metric_blocklist_match_prefix`) and rebuilds the allow/block
+  filter live (`bin/agent-data-plane/src/components/dogstatsd_prefix_filter/mod.rs:285-330`).
+- `dogstatsd_post_aggregate_filter` — same 4 keys
+  (`.../dogstatsd_post_aggregate_filter/mod.rs:268-308`).
+- `tag_filterlist` — watches `metric_tag_filterlist`, and on change does
+  `self.filters = compile_filters(...); self.context_cache = build_context_cache();`
+  (`.../tag_filterlist/mod.rs:222, 274-278`).
+- `dsd_debug_log` (stats enable) and `internal/logging` (`log_level`) — lower stakes.
+
+The catalog's only two runtime-config properties — `config-incompatible-refuses-start` and
+`config-runtime-update-not-revalidated` — treat config purely as a **crash / incompatibility-gate
+safety** concern (does an incompatible *key* get applied?). Neither asks the question that the
+team's own Confluence page ("Tag Filter RC Relay Stress Test — agent + ADP") is built around:
+**after a filter/tag config update lands, is the data that gets through actually filtered
+correctly, and is that true under load + interleaving + fault?** This is the single biggest
+framing gap. It is squarely in Antithesis's sweet spot (timing of the config-stream event vs the
+data-flow event is exactly what a deterministic diff-test cannot explore) and it is a
+*data-correctness* failure, not a crash — invisible to every existing process-level assertion.
+
+**Concrete hazards inside this surface that no property names:**
+
+1. **`broadcast::Lagged` drops config updates silently → stale filtering, no recovery until the
+   next update.** The watcher reads config over a `tokio::broadcast` channel; on `Lagged` it
+   logs a warn and `continue`s without re-reading the missed value
+   (`lib/saluki-config/src/dynamic/watcher.rs:60-67`). A transform whose `select!` loop is busy
+   draining a full event channel under load (backpressure) can lag the broadcast and **miss a
+   filter update entirely**, then run with stale filters indefinitely. This is a
+   backpressure × config-reload interaction (a directive-#2 combination) that produces wrong
+   filtering output, not a crash. No lens models it.
+2. **Partial-deserialize skip → half-applied multi-key config.** A malformed/wrong-typed value
+   for one watched key is skipped with a warn while the other keys apply
+   (`watcher.rs:43-56`). The prefix/post-agg filters watch 4 interdependent keys; a new
+   `metric_filterlist` can apply while its companion `metric_filterlist_match_prefix` is rejected,
+   leaving the filter in an inconsistent (new-list / old-match-mode) state — silently wrong
+   filtering semantics.
+3. **Key-deletion silently clears all filtering.** `tag_filterlist` does
+   `new_entries.as_deref().unwrap_or(&[])` (`mod.rs:275`): an RC update that *removes* the key
+   yields `None` → empty filter set → all tag filtering silently turned off. Correctness loss with
+   no signal.
+4. **Cache coherency across swap.** `tag_filterlist` rebuilds `context_cache` on swap (good), but
+   the swap and the in-flight event batch are processed in the same `select!` loop — a property
+   should pin down that no metric in the post-swap batch is filtered against a stale cache entry.
+
+**Scope.** Production DSD hot path (prefix filter, tag filterlist, post-agg filter all sit
+between `dsd_in.metrics` and `dd_metrics_encode`, `run.rs:674-679`). Requires Add-on 1 (config
+stream) — these are *not* exercised at all in standalone-mode primary topology (see W4).
+
+**Evidence brief.** `dogstatsd_prefix_filter/mod.rs:285-330`,
+`dogstatsd_post_aggregate_filter/mod.rs:268-308`, `tag_filterlist/mod.rs:222,274-278`,
+`saluki-config/src/dynamic/watcher.rs:43-67`. Confluence "Tag Filter RC Relay Stress Test"
+(referenced in `deployment-topology.md:8`) — the team has already scoped a stress test here that
+the catalog does not mirror.
+
+**Suggested action.** Add a property family `config-filter-reload-correct` (suggest 1-2
+properties): (a) `Always`(after a filter-config update is observed, every subsequently-emitted
+metric reflects the *latest* applied filter — no metric retains a tag the new denylist removes /
+no blocked metric passes); (b) `Sometimes`(config update applied under concurrent load) +
+`Reachable`(broadcast Lagged path) to prove the stale-config branch is reachable. Differential
+formulation (Agent + ADP both receive the same RC update mid-stream, diff output after a quiet
+window) reuses the `aggregate-matches-agent` harness and is the most feasible check. At minimum,
+elevate the watcher's `Lagged`/partial-deserialize/key-deletion behaviors to first-class hazards
+in `config-runtime-update-not-revalidated`, which today scopes them out.
+
+---
+
+## FINDING W2 — The DogStatsD mapper is a self-contained correctness + resource surface with no property (slug: none; nearest `aggregate-matches-agent`)
+
+**Concern.** `dogstatsd_mapper` (`lib/saluki-components/src/transforms/dogstatsd_mapper/mod.rs`)
+sits first on the DSD metrics path (`dsd_enrich` chains it, `run.rs:639-641,674`). It:
+- compiles user-supplied regexes and does capture-group expansion into new metric **names** and
+  new **tag values** (`try_map`, `mod.rs:297-314`) — a rich correctness surface (wrong capture
+  expansion = silently renamed metric / wrong tag, customer-visible data corruption);
+- has its **own result cache** keyed by metric name (`mod.rs:269-285`) and its **own string
+  interner** separate from the aggregate interner (`context_string_interner_size`, default 64KiB);
+- silently drops the metric when its interner is full: `resolve_with_origin_tags(...)?` on both the
+  cache-hit path (`mod.rs:277-282`) and slow path (`mod.rs:318`) returns `None`, and `try_map`'s
+  `None` means the original (pre-map) context is used — but a `None` from the cache-hit branch
+  returns `None` from `try_map` with no telemetry. This is a *second*, mapper-local interner
+  exhaustion path that the catalog's `interner-full-bounded` (scoped to the aggregate/context
+  interner) does not cover.
+
+This is exactly the "correctness of the data that gets through" facet directive #1 flagged. Regex
+capture expansion and a name-keyed cache are classic silent-wrong-data bug shapes; ADP claims
+Agent-equivalence for the mapper too, and nothing tests it under fault or even names it.
+
+**Scope.** Production DSD hot path, first transform. Memory facet overlaps Category A but with a
+distinct interner instance.
+
+**Evidence brief.** `dogstatsd_mapper/mod.rs:31-32` (wildcard match regex), `186-211`
+(`build_regex`, regex compiled from `*`→`([^.]*)`), `258-345` (`try_map`, cache + capture
+expansion + `?`-drop), separate interner at `103-165`.
+
+**Suggested action.** Add `mapper-output-matches-agent-and-bounded` (or fold into a broadened
+transform-correctness differential): assert mapped names/tags match the Agent's mapper for the
+same input + profiles; `Always`(mapper interner full ⇒ counted drop, not silent / not heap-spill
+depending on config); `Sometimes(mapper_interner_full)`. Also worth a cheap regex-DoS angle:
+user-supplied `regex`-type mappings are compiled with the `regex` crate (no catastrophic
+backtracking, good) — confirm and note as a *pass* so it isn't re-flagged.
+
+---
+
+## FINDING W3 — Replay-then-aggregate timestamp divergence is a correctness hazard mentioned in the SUT analysis but has no property
+
+**Concern.** SUT analysis §7.9 explicitly states: "a replayed capture buckets differently than
+when captured (the aggregator ignores per-record timestamps for non-timestamped metrics)." The
+new replay feature (`e88d04b10a`, the most regression-prone area per §8) re-injects captured
+DSD records through the live socket; non-timestamped metrics are then bucketed by **current wall
+clock at replay time**, not capture time. The catalog has three replay properties
+(`replay-no-panic-on-malformed-capture`, `replay-corruption-not-silent-eof`) — all about
+*crash / parse fidelity*, none about *aggregation fidelity of replayed data*. Given replay is the
+newest, largest, riskiest feature shipping for the first customers, "does replayed data aggregate to the same
+result as the original" is a correctness question the catalog skips.
+
+**Scope.** Replay CLI → DSD socket → aggregate. Listener-coverage variant topology.
+
+**Evidence brief.** SUT analysis §7.9; replay re-injection via DSD UDS
+(`sources/dogstatsd/replay/`, replay CLI `dogstatsd.rs:394`); aggregate wall-clock bucketing
+(`aggregate-clock-skew-stable` evidence).
+
+**Suggested action.** Either (a) add a note/property that replay fidelity for non-timestamped
+metrics is *by design* lossy w.r.t. bucketing (document, assert nothing) — needs human input on
+intent; or (b) if capture preserves timestamps, assert replayed aggregation matches original
+within ratio. Flag for the team; do not over-engineer until intent is confirmed.
+
+---
+
+## FINDING W4 — Standalone-mode primary topology structurally hides the entire control-plane → data-plane config surface (catalog-wide / topology)
+
+**Concern.** `deployment-topology.md` runs ~22/27 properties in **standalone mode**, which
+"bypasses the remote-agent config stream" (topology doc, Add-on 1). The doc's own open question —
+"Confirm no standalone-only code path masks a production behavior we care about" — is answered by
+W1/W2: standalone mode masks the *entire runtime-config-driven filtering/mapping correctness
+surface*, because in standalone there is no RC stream to deliver filter/tag updates, so the
+`watch_for_updates` branches in all five components **never fire**. Any property that doesn't
+force Add-on 1 will *vacuously pass* on this surface. The catalog buries the config stream in an
+"Add-on" for 3 properties; in production, RC-driven filtering is a primary, always-on behavior
+(the design partner's whole tag-filter relay use case). The topology under-weights it.
+
+**Scope.** Whole config-stream cluster + W1 + W2.
+
+**Evidence brief.** `deployment-topology.md:43,78-101,166-168`; `watcher.rs:29-32`
+(`if self.rx.is_none() { pending_forever }` — in standalone the watcher future never resolves).
+
+**Suggested action.** Promote the config-stream add-on to a co-equal primary topology (or make
+the stub mandatory), and route W1's new properties through it. Note explicitly in the catalog
+that filter/tag/mapper-reload properties are vacuous in standalone mode.
+
+---
+
+## FINDING W5 — Duplicate / over-overlapping properties (catalog hygiene)
+
+Confirmed overlaps the other lenses should reconcile:
+
+1. **`shutdown-drains-no-loss` vs `graceful-shutdown-within-30s`** — the catalog already
+   acknowledges the split ("one owns *what data survives*, the other owns *clean completion in
+   time*"), and the split is defensible, but both assert against the same 30s-timeout boundary and
+   the same forceful-stop path with the same fault setup. Risk: they will be implemented as one
+   instrumented run with two assertions, so counting them as two "properties" inflates the
+   portfolio. Keep both assertions, but treat as one test scenario.
+2. **`data-component-failure-triggers-process-shutdown`** overlaps the forceful-stop clause of
+   `graceful-shutdown-within-30s` and the crash trigger of `aggregate-no-panic-any-window` /
+   `aggregate-clock-skew-stable` (those panics are *how* you induce the component failure). These
+   are three properties sharing one mechanism (component dies → process exits → s6 restarts).
+   Fine to keep, but the "induce a panic" half is the same event.
+3. **`non-finite-values-handled-consistently` vs `ddsketch-no-nan-poison`** — `non-finite` is
+   largely a superset: its invariant *is* `Always(value.is_finite() at DDSketch insert boundary)`,
+   which is `ddsketch-no-nan-poison`'s core, plus the ghost-metric clause. The catalog demotes
+   `non-finite` to Medium and keeps `ddsketch-no-nan-poison` High with the live `checks_ipc`
+   bypass as justification. Reasonable, but the assertion site is *identical*
+   (`adjust_basic_stats`/`insert*`); these should be one SUT-side assertion with two reachability
+   anchors, not two separately-instrumented properties.
+4. **`ddsketch-relative-error-bound` vs `ddsketch-bin-count-bounded`** — `relative-error` is
+   already demoted to a library/harness invariant (ADP doesn't call `quantile` live) and
+   `bin-count-bounded` owns the live facet. Borderline whether `relative-error` belongs in an
+   *Antithesis* catalog at all (it's a pure proptest target with existing proptests) — Lens 1
+   territory, flagging for cross-check.
+
+**Suggested action.** Mark 1-3 as "shared-scenario" pairs so the portfolio count isn't read as
+independent coverage; let Lens 1 rule on whether `ddsketch-relative-error-bound` is Antithesis-
+appropriate at all.
+
+---
+
+## FINDING W6 — Mis-prioritization given the 7.80.0 ship context
+
+**Concern.** The ship context is: first customers, *design partner*, whose documented
+interest (Confluence) is the **tag-filter RC relay**. The catalog's two highest-effort, highest-
+visibility High items are `aggregate-matches-agent` (heaviest topology, own run) and the
+memory-bounds family (much of which is "expected to FAIL by design under default config" — i.e.
+known limitations, not regressions). Meanwhile the design partner's actual feature — runtime tag
+filtering correctness (W1) — has no property. Relative to ship context, W1 should arguably be a
+High before some of the "fails-by-design" memory items, which document known gaps the team
+already understands rather than surfacing surprises.
+
+**Suggested action.** Re-rank: W1 (filter-reload correctness) to High; keep the two guaranteed-
+crash config/clock findings High (cheap, real, ship-blocking). Consider that several Category-A
+"fails by design" properties are really *documentation of a known limitation* and could be Medium
+unless the team intends to flip defaults before 7.80.0.
+
+---
+
+## PASSES (things the catalog/lenses got right; do not re-flag)
+
+- The two guaranteed-crash findings (`aggregate-no-panic-any-window` sub-second window divide-by-
+  zero; `aggregate-clock-skew-stable` forward-jump flood) were correctly High, cheap, and real —
+  verified the code shapes matched the catalog claims. _Update 2026-05-30 — the sub-second
+  divide-by-zero is now **fixed upstream** (window typed `NonZeroU64`, PR #1772) and demoted to a
+  regression tripwire; the forward-jump flood remains live._
+- `ddsketch-no-nan-poison`'s live `checks_ipc` bypass is a genuine, correctly-prioritized latent
+  bug.
+- The default-config-is-hostile framing for Category A is accurate and well-evidenced.
+- `host_enrichment` is static (hostname queried once at build, `host_enrichment/mod.rs:67-75`) —
+  **not** a runtime-mutable correctness surface; correctly *absent* from the catalog. Don't add it.
+- OTTL filter/transform processors are wired into the **traces** path only
+  (`run.rs:561-567`), not the DSD metrics hot path, and their panic sites are `#[cfg(test)]` —
+  correctly out of scope for the DSD-focused topology. (Note: if a traces topology is ever added,
+  OTTL untrusted-config parsing becomes a parse-safety surface like the replay reader.)
+- The mapper compiles user regex via the `regex` crate (no catastrophic backtracking by
+  construction) — the regex-DoS angle is a non-issue; record as closed.
+
+## UNCERTAINTIES (report-what's-odd; could not fully resolve)
+
+- **Is RC-stream filter update even reachable from the Core Agent for these keys?** Same open
+  question the catalog raises for `config-runtime-update-not-revalidated` ("can a High-severity key
+  be delivered, or does the Agent pre-filter?"). If the Agent *does* push `metric_tag_filterlist` /
+  `metric_filterlist` updates at runtime (the relay use case implies yes), W1 is High-value
+  and live; if these are startup-only in practice, W1 collapses to a code-review note. Needs human/
+  team input — pivotal for W1's priority.
+- **Does `tag_filterlist` only filtering Counter + sketch metrics (`mod.rs:235-237`) match the
+  Agent?** Gauges/rates appear to pass through tag-filtering untouched. Could be intentional
+  (sketches/counters are the cardinality drivers) or a correctness gap vs the Agent. Odd enough to
+  flag; no property would catch it today.
+- **broadcast channel depth for config events** — could not determine the `broadcast::channel`
+  capacity that feeds `FieldUpdateWatcher`; how easily `Lagged` triggers under load (W1 hazard 1)
+  depends on it. If depth is large, the stale-config window is narrow; if 1-16, it's very
+  reachable under backpressure. Worth a one-line code check before sizing the W1 workload.
+- **Whether the three "shared-scenario" property pairs (W5) are double-counted in the
+  coverage-balance portfolio math** — Lens 2's distribution counts may overstate independent
+  coverage by ~3-4 properties.
diff --git a/test/antithesis/scratchbook/existing-assertions.md b/test/antithesis/scratchbook/existing-assertions.md
new file mode 100644
index 00000000000..8a80c280b45
--- /dev/null
+++ b/test/antithesis/scratchbook/existing-assertions.md
@@ -0,0 +1,77 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: Datadog ADP Confluence space (design notes, weekly summaries, gap analyses) consulted for grounding.
+  - path: https://datadoghq.atlassian.net/browse/DADP
+    why: ADP Jira project for incidents and tracked gaps.
+  - path: https://github.com/DataDog/saluki/pull/1768
+    why: PR review #4393897611 — re-running research caught that the harness now adds SDK assertions this file previously denied.
+---
+
+# Existing Antithesis SDK Assertions
+
+## Summary
+
+**A small bootstrap-and-workload assertion set exists**, added by the Antithesis harness commit
+(`chore(agent-data-plane): Antithesis test harness and workload`, the parent of this scratchbook
+commit). It comprises **6 SDK call sites** across three binaries: one lifecycle init and one
+bootstrap reachability probe in ADP (both gated behind the `antithesis` cargo feature, no-op in
+production), plus two workload-side `assert_reachable!`/`assert_sometimes!` pairs in the harness test
+commands. These are **integration probes and anti-vacuity anchors**, not the property-catalog
+invariants — none of the 35 cataloged property assertions is implemented yet.
+
+> [!NOTE]
+> A prior version of this file stated no SDK assertions existed. That was true before the harness
+> commit landed; it is now stale. Re-research on 2026-05-30 corrected it.
+
+## Assertions present
+
+| File:line | Type | Message | Gating | Purpose |
+|-----------|------|---------|--------|---------|
+| `bin/agent-data-plane/src/main.rs:51` | `antithesis_init()` | (lifecycle init) | `#[cfg(feature = "antithesis")]` | Registers the assertion catalog before any are evaluated; no-op outside Antithesis, absent in prod builds. |
+| `bin/agent-data-plane/src/main.rs:100` | `assert_reachable!` | "agent-data-plane completed bootstrap" | `#[cfg(feature = "antithesis")]` | Bootstrap-integration probe — proves the SDK is linked, cataloging works, the instrumentation path is wired. |
+| `test/antithesis/harness/src/bin/finally_verify_delivery.rs:54` | `assert_reachable!` | "intake metrics dump query succeeded" | harness binary | Confirms the delivery-verification query path ran. |
+| `test/antithesis/harness/src/bin/finally_verify_delivery.rs:59` | `assert_sometimes!` | "metrics delivered end-to-end to the intake" (`delivered > 0`) | harness binary | Workload-side liveness anchor — partially seeds `forwarder-eventual-delivery`. |
+| `test/antithesis/harness/src/bin/parallel_driver_send_dogstatsd.rs:77` | `assert_reachable!` | "workload sent a dogstatsd batch" | harness binary | Confirms the DSD driver actually emitted load. |
+| `test/antithesis/harness/src/bin/parallel_driver_send_dogstatsd.rs:87` | `assert_sometimes!` | "workload drove a high-cardinality dogstatsd flood" (`regime == High`) | harness binary | Anti-vacuity anchor that timelines reach the high-cardinality regime — seeds `rss-bounded-under-cardinality`. |
+
+Dependency wiring: ADP gains the SDK only under the `antithesis` feature
+(`bin/agent-data-plane/Cargo.toml:14` → `dep:antithesis_sdk`, `antithesis_sdk/full`,
+`dep:antithesis-instrumentation`); the harness crate depends on `antithesis_sdk` unconditionally
+(`test/antithesis/harness/Cargo.toml`). `antithesis-instrumentation` is an external build-time
+instrumentation crate, not a source of in-tree assertions.
+
+## How this was determined
+
+Searched the repository with ripgrep over `*.rs` and `*.toml`:
+
+- `rg -li "antithesis" -g '*.rs' -g '*.toml'` — matches in ADP `main.rs`, the two harness binaries,
+  and the `Cargo.toml` files above.
+- `rg "assert_always|assert_sometimes|assert_reachable|assert_unreachable|antithesis_sdk" -g '*.rs'`
+  — the 6 call sites tabled above; **no `assert_always!` and no `assert_unreachable!` anywhere yet.**
+
+## Implication for property work
+
+The catalog's invariants are still **net-new instrumentation**. The two `assert_sometimes!` anchors
+above are workload-side only and serve anti-vacuity, not the safety/liveness invariants themselves:
+
+- `forwarder-eventual-delivery` has a workload-side `Sometimes(delivered > 0)` but no SUT-side
+  no-loss `Always`/accounting assertion — that remains to be added.
+- `rss-bounded-under-cardinality` has its high-cardinality `Sometimes` anchor but no SUT-side RSS or
+  interner-bound `Always` — also net-new.
+- The ~17 properties requiring in-process SUT-side assertions (per evaluation R2) still need ADP to
+  be forked and instrumented behind the `antithesis` feature. The feature scaffold now exists, which
+  lowers that bar — the `antithesis_init()` + bootstrap probe prove the wiring works end-to-end.
+
+Other existing (non-Antithesis) signals remain available to anchor assertions or workload-side
+checkers:
+
+- **Internal telemetry counters** via the `metrics` crate (`events_discarded_total`,
+  `events_sent_total`, aggregate `context_limit` breach counters, forwarder queue-drop counters).
+- **Rust unit tests** with std `assert!`/`assert_eq!`, dense across `saluki-components`,
+  `saluki-core`, `saluki-io`, and `ddsketch` — not Antithesis assertions; run only under `cargo test`.
+- A `loom` cfg in `lib/stringtheory/src/interning/` — the authors already treat the interner's
+  reclamation path as concurrency-sensitive.
diff --git a/test/antithesis/scratchbook/properties/aggregate-clock-skew-stable.md b/test/antithesis/scratchbook/properties/aggregate-clock-skew-stable.md
new file mode 100644
index 00000000000..24e65de132e
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/aggregate-clock-skew-stable.md
@@ -0,0 +1,107 @@
+---
+slug: aggregate-clock-skew-stable
+title: Aggregation stays sane across wall-clock skew
+type: Safety
+priority: High
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+status: assertion MISSING; CONFIRMED two-clock hazard, no monotonicity guard
+---
+
+# aggregate-clock-skew-stable — Aggregation stays sane across wall-clock skew
+
+## Property (one sentence)
+A wall-clock jump (backward or forward) during aggregation never produces a flood of
+zero-value counter points nor a silent gap in counter continuity, and bucketing stays
+bounded and well-formed.
+
+## Origin
+- SUT analysis §7 #9 (Wildcard): "Two-clock hazard: bucketing uses wall clock
+  (`get_unix_timestamp`), flush cadence uses monotonic `tokio::interval`. A backward
+  wall-clock jump makes the zero-value range empty (silent counter gap); a forward jump
+  floods zero-value points and allocates a large `SmallVec`. No monotonicity guard."
+
+## Files / functions / lines (CONFIRMED)
+- `lib/saluki-common/src/time.rs`
+  - `get_unix_timestamp` (21–26): `SystemTime::now().duration_since(UNIX_EPOCH).unwrap_or_default().as_secs()`
+    — **wall clock**, non-monotonic; on a backward step it returns a smaller value (and on
+    pre-epoch it `unwrap_or_default()`s to 0).
+- `lib/saluki-components/src/transforms/aggregate/mod.rs`
+  - Flush cadence: `interval_at(Instant::now() + flush_interval, flush_interval)` (290–293) —
+    **monotonic** tokio timer. So *when* a flush fires is monotonic, but *what timestamp* it
+    stamps is wall-clock.
+  - `insert` reads `current_time = get_unix_timestamp()` (347) per input batch; buckets via
+    `align_to_bucket_start(current_time, bucket_width_secs)` (579).
+  - `flush(get_unix_timestamp(), ...)` (319) — flush timestamp is wall clock.
+  - Zero-value bucket generation (627–635):
+    ```
+    let start = align_to_bucket_start(self.last_flush, bucket_width_secs);
+    for bucket_start in (start..current_time).step_by(bucket_width_secs as usize) { ... }
+    ```
+    `self.last_flush` is the wall-clock time of the previous flush (set at 718).
+    - **Backward jump:** `current_time < start` → range `start..current_time` is **empty** →
+      no zero-value buckets emitted for the gap → silent break in counter continuity (and the
+      `should_expire_if_empty` math `am.last_seen + counter_expire_secs < current_time` (651)
+      can flip, prematurely expiring or never expiring counters).
+    - **Forward jump:** `current_time >> start` → the loop emits one zero-value bucket per
+      `bucket_width_secs` across the entire jump, each pushed into
+      `SmallVec<[(u64, MetricValues); 4]>` (626) → large heap allocation + a flood of
+      zero-value points merged into every counter (661–671) and flushed downstream.
+  - `split_timestamp = align_to_bucket_start(current_time, w).saturating_sub(1)` (620) — a
+    backward jump moves the split earlier, so values already flushed in earlier (now "future")
+    buckets are retained, possibly re-evaluated against a smaller `current_time`.
+  - No comparison/guard between `current_time` and `last_flush` for monotonicity anywhere in
+    `flush` or `insert`.
+
+## Failure scenario (Antithesis angle — clock fault injection)
+1. **Forward jump (e.g. NTP step +1h, width 10s):** next flush generates ~360 zero-value
+   buckets, allocating a large `SmallVec` and emitting hundreds of zero-value rate points per
+   live counter downstream — a metric flood and memory spike (tension with bounded-memory and
+   "match the Agent" — the Agent does not behave this way).
+2. **Backward jump:** the zero-value range goes empty; counters that should have emitted
+   continuity zeros emit nothing for the skipped interval → downstream sees a gap. On a large
+   backward jump, `am.last_seen + counter_expire_secs < current_time` can become false for
+   counters that should expire (they linger, consuming context budget) or true for ones that
+   shouldn't.
+3. **Pre-epoch / clock reset to 0:** `unwrap_or_default()` yields `current_time = 0`, making
+   `align_to_bucket_start(0, w) = 0` and most ranges empty — effectively freezes bucketing.
+4. **Replay divergence (noted in §7 #9):** because non-timestamped metrics are bucketed by the
+   aggregator's *current* wall clock (not per-record capture time), a replayed capture buckets
+   differently than at capture — relevant to the replay feature.
+
+## Observations
+- Antithesis can drive this directly via **clock fault injection** (step the container clock
+  backward/forward) while a steady counter stream flows.
+- Natural bounded invariant: the number of zero-value buckets generated in a single flush must
+  be bounded by a sane multiple of `flush_interval / window_duration` (a couple), NOT by an
+  arbitrary wall-clock delta. `zero_value_buckets.len()` is the in-process anchor.
+- Counter-continuity invariant: across a flush, a live counter's emitted timestamps should be
+  contiguous in bucket-width steps with no missing closed bucket and no duplicate.
+- SUT-side instrumentation wins: `zero_value_buckets.len()` and the `last_flush`/`current_time`
+  pair live inside `flush`; workload-side can only observe the downstream flood/gap indirectly.
+
+## Config dependencies
+- `aggregate_window_duration` (bucket width) and `aggregate_flush_interval` (cadence) set the
+  expected per-flush zero-value bucket count.
+- `counter_expiry_seconds` interacts with the skewed `last_seen` expiry comparison (651, 662).
+
+## Suggested assertion
+- `assert_always(zero_value_buckets.len() <= max_expected, "zero-value bucket count bounded")`
+  inside `flush`, where `max_expected` ≈ `ceil(flush_interval / window_duration) + small_slack`.
+  Flags the forward-jump flood.
+- `assert_always(current_time >= self.last_flush, "aggregate flush time is monotonic")` OR, if a
+  monotonicity guard is added, `AlwaysOrUnreachable` that the backward branch is handled (clamp
+  `current_time` to `>= last_flush`).
+- `Sometimes(clock_jumped_during_flush)` to confirm the skew fault actually coincided with a flush.
+
+## Open questions
+- Intended fix: switch bucketing to a monotonic source, or guard
+  `current_time = max(current_time, last_flush)` and cap the zero-value loop iterations? This
+  decides whether the assertion is `Unreachable(flood)` vs `Always(bounded)`.
+- Is there an upstream protection against `get_unix_timestamp()` returning 0 (pre-epoch)?
+  None found — worth confirming the container clock can't be stepped below epoch in the harness.
+- What is the Agent's behavior under the same clock step? Needed to know whether "bounded and
+  no flood" also means "still matches Agent" (ties into `aggregate-matches-agent`).
+- Does the coarse-time path (`get_coarse_unix_timestamp`, 41–59) feed any aggregate code? (It
+  does not appear to; aggregate uses the accurate `get_unix_timestamp`.) Confirm no second path.
diff --git a/test/antithesis/scratchbook/properties/aggregate-context-limit-enforced.md b/test/antithesis/scratchbook/properties/aggregate-context-limit-enforced.md
new file mode 100644
index 00000000000..d6bfbad1a91
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/aggregate-context-limit-enforced.md
@@ -0,0 +1,121 @@
+# aggregate-context-limit-enforced
+
+**Family:** Resource Boundaries — bounded state / queues
+**Status:** Verified against code at commit 042f41db3b. Property is expected to **HOLD** (this is
+a real, enforced invariant) — it is the load-bearing memory-determinism lever for the aggregator.
+
+## What led to the property
+
+`sut-analysis.md` §3 and §7 identify the aggregation state map as "the central memory-determinism
+lever." Unlike the interner (which spills to heap by default) and the advisory memory limiter
+(off by default), the aggregate context limit is a **hard, always-on cap enforced at insert
+time**, independent of any `memory_mode`. It is the one runtime memory bound that is not
+advisory. Worth asserting precisely because it is the strongest claim ADP actually makes about
+bounded aggregation state.
+
+## The invariant and where it lives
+
+Aggregation state is a single `HashMap<Context, AggregatedMetric>` (`hashbrown`) owned solely by
+the transform task — no locks, all mutation `&mut self` (`transforms/aggregate/mod.rs`,
+`AggregationState`). The cap:
+
+- Default `aggregate_context_limit = 1_000_000` (`mod.rs:47-49` `default_context_limit`, field at
+  `mod.rs:114-115`, stored in `AggregationState.context_limit` at `mod.rs:531`).
+- Enforced in `AggregationState::insert` (`mod.rs:566-571`):
+  ```rust
+  if !self.contexts.contains_key(metric.context()) && self.contexts.len() >= self.context_limit {
+      self.context_limit_breached = true;
+      return false;   // new context over the cap is DROPPED
+  }
+  ```
+  Critically the guard is gated on `!contains_key`: an **existing** context always proceeds to
+  merge (lines 573+), so the cap only ever rejects *new* contexts, never breaks aggregation of
+  already-tracked ones.
+- The caller (`mod.rs:375-384`) treats `insert == false` as a drop: increments
+  `events_dropped` telemetry and logs **one** warning per breach episode (gated on
+  `was_breached` so it doesn't spam).
+- Recovery: `context_limit_breached` is cleared once `contexts.len() < context_limit` again
+  (`mod.rs:714-715`), e.g. after a flush evicts contexts.
+
+So the precise invariant: **live context count never exceeds `aggregate_context_limit`; over-cap
+*new* contexts are dropped-and-counted; *existing* contexts always merge.**
+
+## Failure scenario (Antithesis)
+
+Flood DSD with far more than `aggregate_context_limit` distinct contexts (set the limit low,
+e.g. 1000, to make the boundary reachable within a run). Assert the map size never exceeds the
+cap and that drops are counted. Antithesis adds value over the deterministic correctness harness
+by interleaving the flood with **flush timing** and **counter zero-value keep-alive**: zero-value
+counters kept alive after flush still count against the limit (`sut-analysis.md` §3), so the
+boundary can be hit by keep-alives, not just fresh contexts — a timing-sensitive interaction the
+fixed-clock harness won't explore. Also tests the recovery edge: does `len()` correctly dip below
+the cap after a flush and re-admit new contexts (clearing `context_limit_breached`)?
+
+## Suggested assertions (NET-NEW — see existing-assertions.md: NO SDK assertions exist)
+
+- `Always(state.contexts.len() <= context_limit)` anchored in `insert`/after-insert in
+  `transforms/aggregate/mod.rs`. Safety: must hold on every check. Honest — there is no code path
+  that grows the map past the cap, so this is a true `Always`, not an aspirational one.
+- `AlwaysOrUnreachable(contains_key(ctx) || len < limit ⇒ insert succeeds)` — i.e. an existing
+  context is *never* dropped by the cap. Captures the "existing always merges" half. Use
+  AlwaysOrUnreachable because the merge-of-existing path may not be exercised in a given run.
+- `Sometimes(context_limit_breached == true)` — proves the workload actually reaches the boundary
+  (otherwise the `Always` above is vacuously true). Liveness/reachability of the interesting state.
+- `Sometimes(events_dropped incremented due to context limit)` — proves the drop is counted, not
+  silent-and-uncounted.
+
+This is a strong candidate for a true SUT-side `Always` because the bound is a local, lock-free,
+single-owner invariant — exactly the kind Antithesis `Always` is designed for.
+
+## Configuration dependencies
+
+- `aggregate_context_limit` (default 1,000,000). For a finite-duration run, must be lowered so
+  the boundary is reachable.
+- `counter_expiry_seconds` (default 300): kept-alive zero-value counters occupy context slots
+  until expiry, affecting how easily the cap is reached and when `len()` dips below it.
+- `aggregate_window_duration` / primary flush interval (default 15s): flush cadence drives when
+  contexts are evicted and the breach flag clears.
+
+## Open questions
+
+- The cap counts *contexts*, not *bytes*. A single context with many distinct timestamped values
+  is one map entry but unbounded value memory (cross-ref `rss-bounded-under-cardinality`). So this
+  property bounds entry count, NOT aggregator memory. Prose must not overclaim "bounded memory."
+
+## Investigation Log
+
+#### Zero-value keep-alive counters: storage location and flush-time `contexts.len()` behavior
+- **Examined**: `lib/saluki-components/src/transforms/aggregate/mod.rs`:
+  `AggregatedMetric` struct (522-525), `AggregationState` (529-558), `insert` (566-610),
+  `flush` (612-719), and the dedicated test `context_limit_with_zero_value_counters`
+  (1104-1157). Also the module doc on zero-value counters (71-75) and `is_empty` (562-564).
+- **Found (a) — storage**: There is **NO separate structure** for zero-value/keep-alive
+  counters. An idle counter remains as a normal entry in the single
+  `contexts: HashMap<Context, AggregatedMetric>` map (529). On flush, closed-bucket values are
+  split off and emitted (682-695) leaving `am.values` empty; the entry is only removed if
+  `am.values.is_empty() && should_expire_if_empty` (697). For counters,
+  `should_expire_if_empty` is **false** until `last_seen + counter_expire_secs < current_time`
+  (649-654), so a kept-alive counter is an empty-valued entry that **stays in `contexts`** and
+  on each subsequent flush gets a fresh `0.0` bucket merged in (661-672) and re-emitted.
+- **Found (a) — cap check counts them**: `insert` rejects a new context when
+  `!contexts.contains_key(..) && contexts.len() >= context_limit` (568). Since idle counters
+  are live entries in `contexts`, they **count toward `context_limit`** at the cap check. The
+  test at 1104-1157 asserts exactly this: with `context_limit = 2`, two counters that have gone
+  to zero-value mode still block insertion of a third (`assert!(!state.insert(... metric3 ...))`,
+  1138), and the third only succeeds (1152) after the two expire and are dropped (1148).
+- **Found (b) — when `len()` drops**: `contexts.len()` drops **during flush**, in the removal
+  pass at 703-707 (`for context in self.contexts_remove_buf.drain(..) { self.contexts.remove(...) }`),
+  only for entries that were marked at 697-700 (empty values AND eligible to expire). The breach
+  flag is cleared right after if `contexts.len() < context_limit` (714-716). So flush DOES remove
+  expired/all-closed non-counter contexts and expired counters; it does NOT remove kept-alive
+  (not-yet-expired, empty) counters. The recovery edge (re-admitting new contexts) is reachable
+  only after kept-alive counters actually expire (`counter_expire_secs`, default 300s) or a
+  non-counter context flushes empty.
+- **Not found**: No code path tracks zero-value counters outside `contexts`; no separate counter
+  toward the limit.
+- **Conclusion**: RESOLVED. The true live bound is exactly **`context_limit`** (map entries),
+  NOT `context_limit + zero_value_count` — kept-alive zero-value counters are ordinary map
+  entries already included in the `len()` cap check. The `Always(contexts.len() <= context_limit)`
+  assertion target is correct as-is. Caveat (already noted): this bounds entry *count*, not bytes;
+  with `counter_expire_seconds` default 300s a flood of sparse counters keeps `len()` pinned near
+  the cap for ~5 min, which delays but does not breach the bound.
diff --git a/test/antithesis/scratchbook/properties/aggregate-matches-agent.md b/test/antithesis/scratchbook/properties/aggregate-matches-agent.md
new file mode 100644
index 00000000000..69ad7766c75
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/aggregate-matches-agent.md
@@ -0,0 +1,96 @@
+---
+slug: aggregate-matches-agent
+title: Aggregated output matches the Datadog Agent
+type: Safety
+priority: High
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+status: assertion MISSING (no Antithesis SDK in tree)
+---
+
+# aggregate-matches-agent — Aggregated output matches the Datadog Agent
+
+## Property (one sentence)
+For the same input metric stream, ADP's aggregated output (counter→rate conversion,
+half-open `[start, start+width)` buckets, histogram/distribution statistics) equals the
+Datadog Agent's output — and that equivalence is preserved under fault injection
+(delayed/skipped flush, restart, backpressure, clock perturbation).
+
+## Origin
+- SUT analysis §5 safety #3 ("Aggregation output matches the Datadog Agent … explicitly
+  'to match the Datadog Agent'").
+- Existing correctness suite is a **diff test** (`bin/correctness/`) that already checks
+  happy-path equivalence deterministically. The Antithesis angle is preserving equivalence
+  under faults the diff harness cannot inject (§6 gaps 1, 5).
+
+## Files / functions / lines
+- `lib/saluki-components/src/transforms/aggregate/mod.rs`
+  - `counter_values_to_rate` (810–815): `MetricValues::Counter(points) => MetricValues::rate(points, interval)`.
+  - Passthrough conversion (451–459): counters→rate using **bucket width** as the rate
+    interval, with the in-code comment "we have to match the behavior of the Datadog Agent. ¯\_(ツ)_/¯".
+  - `transform_and_push_metric` (728–808): histogram → per-statistic metrics (count as rate
+    with bucket width, others as gauge); copy-to-distribution builds a `DDSketch` via
+    `insert_n` per sample (740–750); rate statistics use `MetricValues::rate(.., bucket_width)`.
+  - `is_bucket_closed` (821–843) + doc: half-open `[start, start+width)`, closed iff
+    `(bucket_start + width - 1) < current_time`.
+  - `align_to_bucket_start` (817–819): `timestamp - (timestamp % bucket_width_secs)`.
+  - `flush` split at `split_timestamp = align_to_bucket_start(current_time, w).saturating_sub(1)` (620).
+- Diff harness:
+  - `bin/correctness/stele/src/metrics.rs` `PartialEq for MetricValue` (153–186):
+    Count/Rate/Gauge compared with `approx_eq_ratio(RATIO_ERROR=1e-8)`; **Rate also requires
+    `interval_a == interval_b`** (171); Sketch compared on min/max/avg/sum (ratio) + exact
+    `count()` + exact `bin_count()` (175–182).
+  - `bin/correctness/panoramic` (drives identical workload into Agent + ADP), `millstone`
+    (load gen), `datadog-intake` (mock intake). Fixed `FLUSH_WAIT = 32s` (per SUT analysis §6).
+
+## Failure scenario (Antithesis angle)
+Diff equivalence is established only under a healthy, deterministic run. Faults that can
+break it without the existing suite ever noticing:
+1. **Delayed / skipped flush:** if the monotonic `primary_flush` interval is delayed (CPU
+   starvation, scheduler pause), a bucket that should have closed flushes one interval late.
+   Combined with the wall-clock bucketing (see `aggregate-clock-skew-stable`), the emitted
+   timestamps/rate intervals can diverge from the Agent.
+2. **Restart mid-window:** `flush_open_windows=false` default drops open buckets on shutdown
+   (SUT analysis §3). The Agent baseline and ADP may shed different partial windows on a kill,
+   producing a one-window data delta.
+3. **Backpressure:** a slow downstream blocks the aggregate's dispatcher (`dispatcher.flush().await`,
+   330); if input continues to arrive during the stall, late-arriving updates may land in a
+   different wall-clock bucket than the Agent assigns them.
+4. **Counter→rate interval:** the `interval` carried on a rate is the **bucket width**, not the
+   flush interval. If window vs flush interval are misconfigured relative to the Agent, the
+   `interval_a == interval_b` check (stele 171) fails even when values match.
+
+## Observations
+- This is fundamentally a **differential** property: it requires running both ADP and a
+  Datadog Agent baseline against an identical stream and diffing normalized output. It is not
+  expressible as a single in-process SDK assertion the way the others in this catalog are.
+- Best realized in Antithesis by extending the existing `panoramic` diff harness to run
+  *inside* the Antithesis environment and assert equivalence (`assert_always` the diff is
+  empty / within ratio) **while Antithesis injects faults** (network, process kill+restart,
+  clock). The diff result is the natural assertion anchor.
+- OTLP metrics deliberately **skip aggregation** (SUT analysis §2, `run.rs:751-753`) to avoid
+  counter→rate; equivalence claims apply to the DSD path.
+
+## Config dependencies
+- `aggregate_window_duration` (default 10s) — drives bucket width AND the rate interval.
+- `aggregate_flush_interval` (default 15s) — drives flush cadence (monotonic).
+- `aggregate_flush_open_windows` (default false) — governs restart-window deltas.
+- `counter_expiry_seconds` (default 300) — zero-value counter continuity.
+- `histogram_aggregates` / `histogram_copy_to_distribution[_prefix]` — histogram output shape.
+- Agent baseline must be configured with matching window/flush/expiry for a fair diff.
+
+## Suggested assertion
+- Workload-side (harness): `assert_always(diff_within_ratio, "ADP aggregation matches Agent")`
+  evaluated on the normalized stele diff after each flush window, **with faults active**.
+- `Sometimes(fault_was_injected_during_window)` to confirm the equivalence check actually ran
+  under a perturbed condition (not just clean windows).
+
+## Open questions
+- Does the existing `panoramic` harness tolerate a process restart of ADP mid-run, or does it
+  assume a single long-lived process? (Determines how restart-equivalence is asserted.)
+- What FLUSH_WAIT is needed once faults delay flushes? The fixed 32s may be too short under
+  injected scheduler pauses, causing false diffs that are timing artifacts not correctness bugs.
+- Is the Agent baseline's bucket width guaranteed identical to ADP's `aggregate_window_duration`
+  in the harness config? If not, the `interval` equality check is a false-positive source.
+- Are zero-value counters (continuity) emitted identically by both sides across a skipped flush?
diff --git a/test/antithesis/scratchbook/properties/aggregate-no-panic-any-window.md b/test/antithesis/scratchbook/properties/aggregate-no-panic-any-window.md
new file mode 100644
index 00000000000..ab6f81d8c04
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/aggregate-no-panic-any-window.md
@@ -0,0 +1,98 @@
+---
+slug: aggregate-no-panic-any-window
+title: No aggregate_window_duration value causes a panic
+type: Safety / Reachability
+priority: Low
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+status: FIXED UPSTREAM (window now NonZeroU64, PR #1772); retained as regression tripwire
+---
+
+# aggregate-no-panic-any-window — No `aggregate_window_duration` value causes a panic
+
+> **Update (2026-05-30): FIXED UPSTREAM on main.** The original `% 0` / `step_by(0)` panic vector
+> documented below is now structurally impossible. The config key was renamed
+> `aggregate_window_duration_seconds` and is typed `NonZeroU64` (`transforms/aggregate/mod.rs:95-98`);
+> `bucket_width_secs` is `NonZeroU64` end-to-end and `align_to_bucket_start` divides by
+> `bucket_width_secs.get()` (`:822-823`), which can never be zero. A zero/sub-second value now fails
+> config deserialization instead of reaching the divisor (PR #1772). The forensic detail below is
+> retained as the historical evidence trail for the original defect; the property survives only as a
+> low-cost `Unreachable` regression tripwire. The repro test
+> `tests::bug_sub_second_aggregate_window_panics_on_insert` (in a sibling stack commit) is now stale
+> and should be dropped or converted to a passing guard — see the bug ledger.
+
+## Property (one sentence)
+No configured `aggregate_window_duration_seconds` value causes the aggregate transform to panic;
+the bucket-width divisor is `NonZeroU64` and can never be zero, so zero/sub-second values are
+rejected at config load rather than reaching the modulo path.
+
+## Origin
+- SUT analysis §7 #8 (Wildcard): "Sub-second `aggregate_window_duration` → guaranteed panic:
+  `bucket_width_secs = window.as_secs()` with no validation; a value < 1s yields `% 0`
+  divide-by-zero and `step_by(0)` panics."
+
+## Files / functions / lines (CONFIRMED)
+- `lib/saluki-components/src/transforms/aggregate/mod.rs`
+  - `AggregationState::new` (542–560): `bucket_width_secs: bucket_width.as_secs()` (553).
+    For any `window_duration < 1s`, `as_secs()` truncates to **0**. No validation.
+  - `align_to_bucket_start` (817–819): `timestamp - (timestamp % bucket_width_secs)` →
+    **`% 0` panics** (`attempt to calculate the remainder with a divisor of zero`).
+    Called from `insert` (579) on every metric and from `flush` (620, 628).
+  - `flush` zero-value loop (630): `(start..current_time).step_by(bucket_width_secs as usize)`
+    → **`step_by(0)` panics** (`step_by called with step == 0`). Reached on the 2nd+ flush
+    (`self.last_flush != 0`, 627).
+  - First reachable panic is in `insert` via `align_to_bucket_start` (579) — i.e. on the very
+    first metric, before any flush, if `bucket_width_secs == 0`.
+- Config plumbing (CONFIRMED no validation):
+  - `AggregateConfiguration.window_duration` (92–93) deserialized via serde with
+    `default = default_window_duration` (10s); no `#[validate]` / range check.
+  - `from_configuration` (187–189): `config.as_typed()` — pure deserialize, no bounds check.
+  - `config_registry/datadog/aggregate.rs` (7–13, 50–58): `aggregate_window_duration` declared
+    as `ValueType::String`, `SupportLevel::Full`, `default: None`, `test_json {"secs":42}`.
+    No minimum/positive-value constraint anywhere in the registry.
+  - Repo-wide grep for `aggregate_window_duration` / `window_duration` shows zero validation
+    sites (only definition, default, two uses, and a telemetry nanos field at 1534).
+
+## Failure scenario
+Operator sets `aggregate_window_duration: 500ms` (or any value `< 1s`, e.g. `{"secs":0,"nanos":...}`).
+- `bucket_width_secs = 0`.
+- First DSD metric reaching `dsd_agg` calls `insert` → `align_to_bucket_start(ts, 0)` → `% 0`
+  panic → aggregate task panics → data topology component finishes unexpectedly →
+  `wait_for_unexpected_finish` → **whole-process shutdown** (SUT analysis §2 supervision; data
+  components are fail-stop, not restarted). s6 restarts ADP, which re-reads the same bad config
+  and panics again → crash loop. This directly violates the "won't crash" guarantee.
+- Note: a `1500ms` window does NOT panic (`as_secs() == 1`) but silently truncates the window
+  to 1s — a separate correctness surprise worth a `Sometimes` observation.
+
+## Observations
+- The panic is deterministic given the config; Antithesis value is exercising the config space
+  (including the `{"secs":0,"nanos":N}` Duration shape the registry advertises) and catching the
+  crash, OR validating the planned fix.
+- `PassthroughBatcher` also receives `window_duration` as `bucket_width` (220–224) but only uses
+  it as a `Duration` for the rate interval (`counter_values_to_rate`), not as a divisor — so the
+  passthrough path does not panic on sub-second windows; only the aggregation state path does.
+
+## Config dependencies
+- `aggregate_window_duration` (the sole trigger). Truncation via `as_secs()` means any value in
+  `(0, 1s)` → 0 → panic; `[1s, ...)` → floor to whole seconds (lossy but safe).
+
+## Suggested assertion
+- If the fix is **validation at config load**: `assert_always(window_secs >= 1, ...)` at the
+  point `AggregationState` is constructed (or `AlwaysOrUnreachable` that the build path rejects
+  sub-second windows), plus `Unreachable` on the `% 0` / `step_by(0)` code path.
+- If no fix: `assert_unreachable("aggregate align_to_bucket_start reached with bucket_width_secs == 0")`
+  placed in `align_to_bucket_start` / before the `step_by` loop — Antithesis flags it the moment a
+  sub-second window is fed.
+- `Reachable("aggregate constructed with sub-second window")` to confirm the workload actually
+  explores that config region.
+
+## Open questions
+- Should the fix clamp (`max(1)`), reject at load (fatal config error, consistent with §5 #5
+  "config incompatibility is fatal at startup"), or support genuine sub-second bucketing
+  (would require changing `bucket_width_secs` from `u64` seconds to a finer unit)? This changes
+  whether the assertion is `Always(validated)` vs `Unreachable(panic path)`.
+- Does the gRPC dynamic-config stream allow pushing `aggregate_window_duration` at runtime? If
+  so, a mid-run config update to a sub-second window is a live crash vector, not just a startup one.
+- SUT-side instrumentation needed: the divisor lives deep in `align_to_bucket_start`; an
+  `Unreachable` assertion there is the cleanest signal (workload-side cannot observe the divisor).
diff --git a/test/antithesis/scratchbook/properties/config-incompatible-refuses-start.md b/test/antithesis/scratchbook/properties/config-incompatible-refuses-start.md
new file mode 100644
index 00000000000..a1d85421d28
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/config-incompatible-refuses-start.md
@@ -0,0 +1,152 @@
+---
+slug: config-incompatible-refuses-start
+title: High-severity incompatible config refuses to start the pipeline
+family: Lifecycle Transitions & Configuration
+type: Safety (Reachability / Unreachable)
+priority: Medium
+status: assertion-missing
+sut_commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+---
+
+# config-incompatible-refuses-start
+
+## Origin
+
+SUT analysis §5 Safety #5 ("Config incompatibility is fatal at startup (high-severity incompatible
+non-default key → refuse to run)") and §6 (integration suite has "config-check exit codes" cases).
+
+## Files / functions / lines
+
+- `bin/agent-data-plane/src/cli/run.rs:157`:
+  ```rust
+  check_and_warn_config(&config).error_context("Incompatible configuration detected.")?;
+  ```
+  Crucial ordering: this is at run.rs:157, **before** `create_topology` (run.rs:168),
+  `blueprint.build()` (run.rs:238), and `built_topology.spawn(...)` (run.rs:239). The `?` propagates
+  the error out of `handle_run_command` before any data component is ever built or spawned.
+- `bin/agent-data-plane/src/cli/run.rs:331-381` — `check_and_warn_config`:
+  - Iterates `config.flattened_keys()` (run.rs:336-339).
+  - `config_classifier.classify(&key, &val)` (run.rs:341) → `None` skips invalid/N-A keys.
+  - `classification.is_default` (run.rs:346) → keys with default values are **skipped** (the Agent
+    populates defaults, so only **non-default** values count).
+  - Match on `support_level` (run.rs:351-370):
+    - `Full` / `Incompatible(Low)` / `Partial` / `Incompatible(Medium)` → log + **Proceed**.
+    - `Incompatible(Severity::High)` (run.rs:362-366) → `error!` log +
+      `high_severity_incompatibilities += 1`.
+    - `Ignored` / `Unrecognized` → silently ignored.
+  - run.rs:373-378: if `high_severity_incompatibilities > 0`, returns
+    `Err(generic_error!("{n} incompatible configuration detected. ADP cannot start. …"))`.
+    **All keys are checked before returning** (so the count and all error logs are complete).
+- `bin/agent-data-plane/src/main.rs:136-146` — the `Err` from `handle_run_command` maps to
+  `Some(1)`; `main.rs:101-104` calls `std::process::exit(1)`. So a high-severity incompatibility →
+  **process exit code 1, pipeline never spawned.**
+- `lib/saluki-components/src/config_registry/classifier.rs:42-51` — `classify`: looks up the schema
+  entry, returns `Classification { support_level, is_default }`.
+- `lib/saluki-components/src/config_registry/classifier.rs:53-…` — `is_default_value`: compares the
+  value against the schema's documented default (incl. null/empty-string handling, alias handling).
+- `lib/saluki-components/src/config_registry/mod.rs:144-175` — `Severity { Low, Medium, High }` and
+  `SupportLevel::{Full, Partial, Incompatible(Severity), Ignored, Unrecognized}`. Incompatible keys
+  live in `config_registry/datadog/unsupported.rs` with their severity.
+
+## Key observation
+
+The refuse-to-start gate is correctly placed **before** topology build/spawn, so the safety
+property "pipeline never runs with a high-severity-incompatible non-default key" is structurally
+enforced by control flow, not by a runtime check inside the running pipeline. The two strongest
+Antithesis assertions:
+
+- **Unreachable:** "data pipeline spawned while a high-severity incompatible non-default config key
+  is present." Place an `assert_unreachable!` reading a flag (set during `check_and_warn_config`
+  when a high-severity incompatibility was seen) at/after `spawn()` (run.rs:239). Because the `?` at
+  run.rs:157 returns first, this point is never reached with such a key set — Antithesis confirms it.
+- **Reachable:** "ADP refused to start due to incompatible config" — mark the
+  `high_severity_incompatibilities > 0` return path (run.rs:373) as reachable so the workload knows
+  the refusal path is actually exercised under the incompatible-config workload.
+
+## Failure scenario (Antithesis angle)
+
+- Workload injects a config containing a known high-severity-incompatible key with a **non-default**
+  value (sourced from `config_registry/datadog/unsupported.rs` `Severity::High` entries). Expect:
+  process exits with code 1, no `topology_ready_ms` log, no listener bound, no data forwarded.
+- Negative control: same key but at its **default** value → `is_default` true → skipped → ADP
+  starts normally. Assert the refusal path is NOT taken (the gate must not be over-eager).
+- Multiple high-severity keys → still exits 1, error reports the count (run.rs:374-377). Verify all
+  are logged before exit (debuggability invariant).
+- Medium/Low/Partial incompatible keys → ADP proceeds (warn/debug only). Assert pipeline DOES start
+  — confirms severity gating is graded, not all-or-nothing.
+
+## Config dependencies
+
+- The exact set of `Severity::High` keys lives in `config_registry/datadog/unsupported.rs`
+  (generated/maintained list). The workload needs at least one current High-severity key+non-default
+  value to exercise the refusal path; this list can drift, so the workload should source the key
+  dynamically or be pinned to the commit.
+- Config arrives either from bootstrap YAML/env (run.rs:149 branch) or from the Core Agent dynamic
+  config (run.rs:107-145 branch). `check_and_warn_config` runs on the **final** resolved `config`
+  (run.rs:157) regardless of source, so an incompatible key delivered *over the config stream* is
+  also gated. (Note: on the dynamic path the gate runs once at startup after `ready()`; a key that
+  becomes incompatible via a later partial update is NOT re-checked — see Open Questions.)
+
+## Assertion (MISSING — net-new instrumentation)
+
+No Antithesis SDK assertions exist. Proposed SUT-side:
+- In `check_and_warn_config`, when `high_severity_incompatibilities > 0`, before returning Err:
+  `assert_reachable!("ADP refused to start: high-severity incompatible config")`.
+- Set a process-local flag `saw_high_severity_incompat = true` in that branch; add
+  `assert_unreachable!("pipeline spawned with high-severity incompatible config",
+  saw_high_severity_incompat)` immediately after `built_topology.spawn(...)` at run.rs:239 (it
+  should be statically unreachable because the `?` already returned, but the assertion makes the
+  guarantee explicit and catches any future reordering regression).
+- Alternatively / additionally, workload-side: `process_exits_with(1)` + assert no
+  `topology_ready_ms` log + intake never receives data (mirrors existing integration "config-check
+  exit codes" cases but under fault injection).
+
+## SUT-side instrumentation needs
+
+- Antithesis SDK dependency (none today).
+- Reachable marker on the refusal branch; Unreachable marker after spawn keyed on a
+  high-severity-seen flag.
+- Workload must supply a current `Severity::High` key with a non-default value.
+
+### Investigation Log
+
+#### Are runtime partial config updates re-validated by the incompatibility gate? (2026-05-28)
+
+**Examined:**
+- `bin/agent-data-plane/src/cli/run.rs:157` (`check_and_warn_config` call site), `:331-381`
+  (`check_and_warn_config` body), `:14` (the only import of `ConfigClassifier`/`Severity`/
+  `SupportLevel`).
+- `lib/saluki-config/src/lib.rs:541-651` (`run_dynamic_config_updater` — the task that applies all
+  runtime `Snapshot`/`Partial` updates over the gRPC config stream).
+- Grep across `lib/saluki-config/` and `bin/agent-data-plane/src/` for `ConfigClassifier`,
+  `check_and_warn_config`, `classify(`.
+
+**Found — gate is startup-only, NOT re-run at runtime:**
+- `check_and_warn_config` constructs a fresh `ConfigClassifier::new()` (run.rs:333) and is invoked
+  exactly once, at run.rs:157, before `create_topology`/`build`/`spawn`. The `?` returns the
+  process before the pipeline is built (matches the existing "Key observation" section).
+- The runtime updater `run_dynamic_config_updater` rebuilds the figment on every update
+  (lib.rs:564-578 for the initial snapshot, lib.rs:621-649 for subsequent updates) and dispatches
+  `ConfigChangeEvent`s via `dynamic::diff_config` (lib.rs:633-639), but it contains **no reference
+  to `ConfigClassifier` or `check_and_warn_config`** and performs no support-level/severity check.
+  A `Partial` update is applied via `upsert(&mut dynamic_state, &key, value)` (lib.rs:612) and the
+  new figment is swapped in (lib.rs:645) — unconditionally.
+- The grep confirms `ConfigClassifier` and `check_and_warn_config` appear ONLY in run.rs (the import
+  at :14 and the call/definition at :157/:331). The saluki-config crate that owns the dynamic updater
+  has zero awareness of the classifier.
+
+**Not found:** No runtime re-validation hook, no severity check on `ConfigChangeEvent`, no path that
+re-enters `check_and_warn_config` after startup, and no mechanism that would refuse/halt on a
+runtime-delivered high-severity key. The classifier crate (`saluki-components::config_registry`) is
+not a dependency of the dynamic-update path.
+
+**Conclusion (RESOLVED, scope confirmed):** The incompatibility gate runs **only once at startup**.
+A config key that flips to a high-severity-incompatible value via a later `Partial` (or `Snapshot`)
+update over the gRPC stream is applied to the live figment and broadcast as a change event — ADP
+**keeps running**; it does NOT refuse-to-start or shut down. The `config-incompatible-refuses-start`
+property is therefore correctly scoped to **startup configuration only** (the `?` at run.rs:157
+guards topology spawn against the *startup-resolved* config, including the first snapshot + env
+overlays, per the existing notes). Runtime re-validation is a genuine GAP and warrants a separate
+property/finding ("runtime config update can introduce a high-severity-incompatible key with no
+re-gate") — not folded into this safety property. Recommend filing that gap; this file's Open
+Questions are now resolved.
diff --git a/test/antithesis/scratchbook/properties/config-runtime-update-not-revalidated.md b/test/antithesis/scratchbook/properties/config-runtime-update-not-revalidated.md
new file mode 100644
index 00000000000..e2e7f72ab12
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/config-runtime-update-not-revalidated.md
@@ -0,0 +1,60 @@
+# config-runtime-update-not-revalidated
+
+## Origin
+
+Surfaced during the open-question investigation of `config-incompatible-refuses-start`. ADP's
+incompatibility gate (`check_and_warn_config` + `ConfigClassifier`) protects startup, but the
+**dynamic config stream** from the Core Agent can deliver partial/snapshot updates at runtime that
+are never re-classified. A config key that becomes high-severity-incompatible *after* startup is
+applied silently.
+
+## Code paths
+
+- `bin/agent-data-plane/src/cli/run.rs:157` — `check_and_warn_config(&config)` runs exactly once,
+  before `create_topology` / `build` / `spawn`. Its `Err` aborts startup (`exit(1)`).
+- `ConfigClassifier` / `check_and_warn_config` are referenced **only** in `run.rs` — there is no
+  re-validation hook on the dynamic-config update path.
+- The dynamic config updater (`saluki-config` `lib.rs` ~541-651) applies runtime `Partial`/`Snapshot`
+  updates with no classifier check.
+- The config stream itself: `bin/agent-data-plane/src/internal/remote_agent.rs` (config event loop)
+  pushes `ConfigUpdate::Snapshot/Partial` into the dynamic configuration.
+
+## Failure scenario
+
+The Core Agent pushes a config update (e.g. enabling a feature ADP classifies as
+`Incompatible(Severity::High)`) over the AgentSecure config stream while ADP is running. ADP applies
+it and **keeps running** in a configuration it would have refused to start with — risking wrong
+aggregates or silent data corruption, exactly the outcome the startup gate exists to prevent.
+
+## Property
+
+- **Type:** Safety (Reachability / scope gap)
+- **Invariant:** Either `Unreachable("pipeline running with high-severity incompatible non-default
+  key after a runtime config update")`, or — if the intended design is "startup-only gating" — a
+  `Reachable` marker proving the unguarded runtime-apply path is taken, documenting the gap.
+- **Antithesis angle:** Start ADP with a clean config (passes the gate), then inject a config-stream
+  update carrying a high-severity-incompatible non-default key; observe whether ADP detects/refuses
+  or silently applies it. This exercises the control-plane → data-plane config path the diff-test
+  never touches.
+- **Priority:** Medium (depends on whether high-severity keys are reachable via the stream in
+  practice — a product question).
+
+## Open Questions
+
+- Is this an intentional design choice (startup-only gating, runtime updates trusted because they
+  come from the authoritative Core Agent) or an oversight? `(needs human input)`
+- Can a `Severity::High` key actually be delivered over the config stream, or does the Core Agent
+  pre-filter what it sends to remote agents? Determines real-world reachability.
+
+### Investigation Log
+
+#### Are runtime partial config updates re-validated by the incompatibility gate?
+
+- Examined: `bin/agent-data-plane/src/cli/run.rs:157,331-381` (gate + caller), grep for
+  `check_and_warn_config` / `ConfigClassifier` across the tree, `saluki-config` dynamic updater
+  (`lib.rs` ~541-651), `internal/remote_agent.rs` config event loop.
+- Found: the gate runs exactly once at startup; the classifier is referenced only in `run.rs`; the
+  runtime update path applies `Partial`/`Snapshot` updates with no classifier call.
+- Conclusion: confirmed — a runtime config update can introduce a high-severity-incompatible key
+  with no re-gate, and ADP keeps running. Filed as this standalone property. The remaining questions
+  (intentional vs. oversight; stream reachability of high-severity keys) need the team's input.
diff --git a/test/antithesis/scratchbook/properties/config-stall-no-deadlock.md b/test/antithesis/scratchbook/properties/config-stall-no-deadlock.md
new file mode 100644
index 00000000000..5d9963ec6a9
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/config-stall-no-deadlock.md
@@ -0,0 +1,187 @@
+---
+slug: config-stall-no-deadlock
+title: Config-stream stall does not deadlock or busy-loop startup
+family: Lifecycle Transitions & Configuration
+type: Liveness
+priority: High
+status: assertion-missing
+sut_commit: 042f41db3bd97118c38981765fd49696fce9d318
+---
+
+# config-stall-no-deadlock
+
+## Origin
+
+SUT analysis §2 (control plane: "Startup **blocks** on `dynamic_config.ready().await` for the
+first config (run.rs:119-121, *no timeout shown*). Stream end → reconnect after fixed 5s") and §7
+#13 ("Core Agent reachability assumed at startup: ADP blocks indefinitely on
+`dynamic_config.ready()` with no visible timeout — if the Agent never sends config, ADP never
+starts the pipeline").
+
+## Files / functions / lines
+
+- `bin/agent-data-plane/src/cli/run.rs:119-121`:
+  ```rust
+  info!("Waiting for initial configuration from Datadog Agent...");
+  dynamic_config.ready().await;
+  info!("Initial configuration received.");
+  ```
+  This is on the `use_new_config_stream_endpoint` path (run.rs:107). It is reached only after
+  `RemoteAgentBootstrap::from_configuration` (run.rs:96-104) — which itself **blocks on the initial
+  registration** (`remote_agent.rs:97-105`, `init_reg_rx.await`).
+- `lib/saluki-config/src/lib.rs:687-704` — `GenericConfiguration::ready()`: awaits a `oneshot`
+  receiver (`ready_signal`). **No timeout.** If the oneshot never fires it awaits forever. If the
+  sender is dropped, `ready_rx.await` returns `Err` and `ready()` logs an error and **returns**
+  (so a dropped channel unblocks startup; an idle-but-open channel does not).
+- `lib/saluki-config/src/lib.rs:541-584` — `run_dynamic_config_updater`: the oneshot
+  `ready_sender.send(())` fires (lib.rs:581) **only after the first `ConfigUpdate::Snapshot`** is
+  received and the figment is rebuilt. If `receiver.recv()` returns `None` before any snapshot
+  (channel closed), the task returns *without* sending ready (lib.rs:546-552) → `ready()` sees the
+  sender dropped → returns with an error log (does NOT hang). If the channel stays open but no
+  snapshot ever arrives, the task awaits at lib.rs:546 forever and `ready()` hangs forever.
+- `bin/agent-data-plane/src/internal/remote_agent.rs:251-304` — `run_config_stream_event_loop`:
+  - `:260` waits for a session ID (`session_id.wait_for_update().await`).
+  - `:262-263` opens `stream_config_events` and drains it.
+  - On stream error (`:295-298`): logs `error!` and **continues the inner while-loop** — this can
+    spin if the stream yields a steady error item without ending (see Open Questions).
+  - On stream end (outer loop falls through to `:301-302`): `debug!("Config stream ended, retrying
+    in 5 seconds…"); tokio::time::sleep(Duration::from_secs(5)).await;` — fixed 5s reconnect
+    backoff, then loops back to `:255`.
+- `remote_agent.rs:139-148` — `create_config_stream` builds the `mpsc::channel(100)` whose receiver
+  feeds `run_dynamic_config_updater`. The config-stream event loop is the **sender** side; it only
+  drops the sender by `return`ing (e.g. `:289-292` when the dynamic config channel is closed).
+
+## Investigation: IS there a timeout?
+
+**No.** Confirmed by reading `GenericConfiguration::ready()` (lib.rs:694-704): a bare
+`ready_rx.await` with no `tokio::time::timeout` wrapper, and `run.rs:120` calls it bare. There is
+also no timeout on `init_reg_rx.await` in `RemoteAgentBootstrap::from_configuration`
+(remote_agent.rs:97). The registration *retries* in the background (registration loop
+`remote_agent.rs:185-249`, `DEFAULT_REFRESH_INTERVAL=30s`, `REFRESH_FAILED_RETRY_INTERVAL=5s`), and
+the first registration result is forwarded to `init_reg_rx` (success or error) — so the **bootstrap
+registration** does resolve (Ok or Err) on the first attempt. But the subsequent **config-stream
+`ready()`** has no timeout: if the Core Agent registers ADP but never streams a snapshot, ADP hangs
+at run.rs:120 indefinitely, logging only the single "Waiting for initial configuration…" line.
+
+## Honest framing of the property
+
+This is a **liveness** property with two acceptable outcomes (not a crash, not a busy-loop):
+1. **Progress:** ADP eventually receives the first snapshot and logs "Initial configuration
+   received." → proceeds to build topology.
+2. **Bounded waiting:** ADP remains observably blocked at "Waiting for initial configuration from
+   Datadog Agent…" — a *quiescent* await (parked on a oneshot / parked in `sleep`), **not** burning
+   CPU and **not** panicking.
+
+The property to assert is therefore: **the config stall never produces a crash, panic, or
+busy-loop; ADP is either making progress or quiescently waiting.** It is NOT correct to assert
+"ADP always eventually starts" — with no timeout and a permanently-silent Agent, it legitimately
+never starts. Document that the *absence of a timeout* is the design as-is (matches s6-supervised
+container model where the operator/Agent presence is assumed).
+
+## Failure scenarios (Antithesis angle)
+
+- **Drop the config snapshot:** Core Agent registers ADP but the config stream never sends a
+  `Snapshot` (or sends only `Partial`). Expect: quiescent block at run.rs:120; CPU near zero; no
+  panic. Falsify on busy-loop (high CPU while "waiting").
+- **Flap the stream:** stream repeatedly opens then ends (EOF). Expect: reconnect every 5s
+  (remote_agent.rs:302), no tighter spin. Sometimes(reconnect-after-5s path taken).
+- **Steady stream error (no EOF):** stream yields `Err` items continuously
+  (remote_agent.rs:295-298 `continue`s the inner loop without backoff). **Potential busy-loop /
+  log-flood.** This is the highest-value falsification target — assert CPU/iteration rate bounded.
+- **Close the dynamic-config channel mid-startup:** sender drop → `ready()` returns with error log,
+  startup proceeds (or downstream fails). Verify no hang and no panic.
+- Network partition between ADP and Core Agent during/after registration.
+
+## Config dependencies
+
+- `use_new_config_stream_endpoint` (run.rs:93) — gates whether `ready()` is awaited at all. If
+  false (legacy `remote_agent_enabled` only), the `(bootstrap_config, bootstrap_dp_config)` branch
+  (run.rs:149) is taken and there is **no `ready()` wait** → property N/A.
+- `standalone_mode` (run.rs:91): standalone skips remote-agent bootstrap entirely → no config stall.
+- `secure_api_listen_address` (remote_agent.rs:75) — needed for registration.
+
+## Assertion (MISSING — net-new instrumentation)
+
+No Antithesis SDK assertions exist. Proposed:
+- Wrap the conceptual "waiting for config" region with a `Sometimes("config wait was entered")`
+  reachability marker just before run.rs:120, and a `Reachable("initial configuration received")`
+  just after run.rs:121 — so the workload can distinguish "stalled" vs "progressed".
+- The busy-loop hazard (remote_agent.rs:295-298) is best caught **workload-side**: monitor CPU /
+  log-line rate while the config stream errors; assert bounded. No clean in-process assertion.
+- `Always("no panic in config path")` is implicit (panic = crash = Antithesis catches it); not a
+  bespoke assertion.
+
+## SUT-side instrumentation needs
+
+- Antithesis SDK dependency (none today).
+- Reachability markers around run.rs:119-121 to separate "entered wait" from "config received".
+- Workload-side CPU/log-rate monitor for the busy-loop hazard.
+
+### Investigation Log
+
+#### Steady stream error: busy-loop or backoff? + `init_reg_rx` boundedness (2026-05-28)
+
+**Examined:**
+- `bin/agent-data-plane/src/internal/remote_agent.rs:251-304` (`run_config_stream_event_loop`),
+  `:185-249` (`run_remote_agent_registration_loop`), `:162-183` (`RemoteAgentState::new`).
+- `lib/datadog-agent-commons/src/ipc/client/mod.rs:202-224` (`stream_config_events`) and
+  `lib/datadog-agent-commons/src/ipc/client/streaming.rs:53-93` (`StreamingResponse::poll_next`,
+  the stream type ADP iterates) plus its regression tests `:105-133`.
+- tonic 0.14.6 `Streaming::poll_next` at
+  `~/.cargo/registry/.../tonic-0.14.6/src/codec/decode.rs:399-421`.
+- `lib/datadog-agent-commons/src/ipc/session.rs:67-103` (`SessionIdHandle`).
+
+**Found — busy-loop question RESOLVED, NOT a bug:**
+- The stream ADP drains is `StreamingResponse<ConfigEvent>`, which wraps either an `Initial`
+  RPC-establishment future or a tonic `Streaming<T>`, plus a `Terminated` state (streaming.rs:11-23).
+- An **initial** RPC error (connection refused, RPC rejected, session invalid) → `Outcome::Terminate`
+  → the stream fuses to `Terminated` and yields `Some(Err(status))` exactly **once**, then `None`
+  forever (streaming.rs:70-72, 86-89). This is the dominant error mode for a steadily-failing stream
+  (the RPC never establishes), and it terminates immediately → outer loop hits the **5s sleep**
+  (remote_agent.rs:301-302). Confirmed by the test
+  `streaming_response_terminates_after_initial_error` (streaming.rs:105-122).
+- A **mid-stream** error from an already-established `Streaming<T>`: `StreamingResponse` yields the
+  `Some(Err(_))` (does NOT itself terminate, streaming.rs:74-77), but the underlying tonic
+  `Streaming::poll_next` (decode.rs:399-421) yields the error **once** then transitions to
+  `State::Error(None)` so the very next poll returns `Poll::Ready(None)` (decode.rs:403-408,
+  `status.take()` empties the Option). The explicit comment at decode.rs:403-405 confirms: "yield
+  that error once and then on subsequent calls return Poll::Ready(None)".
+- Net: in BOTH error modes the inner `while let Some(result) = stream.next().await` loop
+  (remote_agent.rs:263) sees at most one `Err`, ADP logs one `error!` (remote_agent.rs:295-298) and
+  `continue`s, then the next `.next()` yields `None` → inner loop exits → the **5s sleep**
+  (remote_agent.rs:302) runs before reconnect. There is NO unbounded spin and NO reconnect tighter
+  than 5s. The `continue` at :297 can iterate at most once per stream instance.
+- A residual log/CPU concern only remains if the Core Agent could establish the stream and then emit
+  a *steady cadence* of error frames over a long-lived HTTP/2 body without ever closing it — but
+  tonic ends the body on the first decode/transport error, so this is not reachable with the standard
+  client. The hazard described in the Failure Scenarios ("steady stream error, no EOF") is therefore
+  NOT realizable through this stack; downgrade it from "highest-value falsification target" to a
+  non-issue. Flap-the-stream (repeated EOF every 5s) remains the realistic shape.
+
+**Found — `init_reg_rx.await` (remote_agent.rs:97) is bounded:**
+- `RemoteAgentState::new` always initializes `session_id: SessionIdHandle::empty()`
+  (remote_agent.rs:176) and `initial_registration_tx: Some(init_reg_tx)` (:178). The handle is freshly
+  created per bootstrap (`RemoteAgentState::new` returns it, :85), so it cannot already hold a
+  session ID.
+- The registration loop's first `loop_timer.tick()` (remote_agent.rs:192) fires immediately (tokio
+  interval first tick is immediate). With an empty `session_id`, `state.session_id.get()` returns
+  `None` (session.rs:84-90) → the loop takes the `None` register branch (remote_agent.rs:206-246), NOT
+  the refresh branch. Both Ok and Err arms send on `initial_registration_tx.take()`
+  (remote_agent.rs:233-235 and :241-243). So the first result (success or failure) is always
+  delivered → `init_reg_rx.await` resolves on the first attempt.
+- The "session_id already Some on first tick → refresh branch, never sends" path is **impossible**:
+  the only writer of a non-None session ID is the register branch itself (`:230`), which has not yet
+  run on the first tick. Confirmed no path leaves `initial_registration_tx` unsent.
+
+**Not found:** No metric or counter for stream-reconnect cadence; the only signal is the
+`debug!`/`error!` logs at remote_agent.rs:296 and :301. No per-iteration throttle beyond the
+terminate-then-5s-sleep structure (none needed given the termination semantics above).
+
+**Conclusion:** The busy-loop concern is resolved — a steadily-erroring config stream cannot
+spin: the stream terminates after one error and the loop backs off 5s. The `init_reg_rx.await`
+bootstrap wait is bounded (always resolves Ok/Err on the first registration attempt). The remaining
+true liveness gap is unchanged and is the *snapshot stall* (`ready()` at run.rs:120 has no timeout):
+a Core Agent that registers ADP but never streams a `Snapshot` leaves ADP quiescently blocked
+forever. Property framing should drop the "steady stream error → busy-loop" falsification target as
+unreachable through tonic, and keep the "drop the snapshot → quiescent (low-CPU) indefinite block,
+no panic" assertion as the load-bearing one.
diff --git a/test/antithesis/scratchbook/properties/data-component-failure-triggers-process-shutdown.md b/test/antithesis/scratchbook/properties/data-component-failure-triggers-process-shutdown.md
new file mode 100644
index 00000000000..6be7188d7b6
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/data-component-failure-triggers-process-shutdown.md
@@ -0,0 +1,134 @@
+---
+slug: data-component-failure-triggers-process-shutdown
+title: Any data component finishing triggers whole-process shutdown (fail-stop)
+family: Lifecycle Transitions & Configuration
+type: Safety (Always) + Reachability
+priority: High
+status: assertion-missing
+sut_commit: 042f41db3bd97118c38981765fd49696fce9d318
+---
+
+# data-component-failure-triggers-process-shutdown
+
+## Origin
+
+SUT analysis §2 Supervision ("**the primary data topology is NOT** [supervised] — `RunningTopology`
+spawns each data component into a `JoinSet` with **no restart**. Any data component finishing →
+`wait_for_unexpected_finish` → **whole-process shutdown**") and §7 (fail-stop recovery model: s6
+restarts ADP on exit). The invariant: ADP must never run a **silently half-running** pipeline.
+
+## Files / functions / lines
+
+- `lib/saluki-core/src/topology/built.rs:158-410` — `BuiltTopology::spawn`: each component
+  (sources, transforms, encoders, forwarders, destinations, relays, decoders) is spawned via
+  `spawn_component` (built.rs:666-687) into a single `JoinSet<Result<(), GenericError>>`. There is
+  **no per-component restart wrapper** — unlike `runtime/supervisor.rs`, the topology does not
+  re-spawn a finished component.
+- `lib/saluki-core/src/topology/running.rs:40-51` — `wait_for_unexpected_finish`:
+  ```rust
+  let task_result = self.component_tasks.join_next_with_id().await
+      .expect("no components to wait for");
+  handle_task_result(&mut self.component_task_map, task_result, /*unexpected=*/true);
+  ```
+  Returns as soon as **any one** component task finishes (Ok, Err, or panic). It does NOT loop /
+  restart — the single completion is surfaced to the caller.
+- `bin/agent-data-plane/src/cli/run.rs:280-283` — in the main `select!`:
+  ```rust
+  _ = running_topology.wait_for_unexpected_finish() => {
+      error!("Topology component unexpectedly finished. Shutting down...");
+      topology_failed = true;
+  },
+  ```
+  Any component finishing wins this select arm → falls through to
+  `running_topology.shutdown_with_timeout(30s)` (run.rs:290) → shuts down the **whole** topology →
+  process proceeds to exit. With `topology_failed = true`, the final result is `Ok` (clean shutdown)
+  but logged as "Topology shutdown complete despite error(s)." (run.rs:302-303). Process then exits;
+  the container's s6 supervisor restarts ADP (full-process restart = recovery model).
+- `lib/saluki-core/src/topology/running.rs:130-162` — `handle_task_result` with `unexpected=true`:
+  a clean `Ok(())` finish is logged as `warn!("Component unexpectedly finished.")` (running.rs:140);
+  an `Err` as `error!("Component stopped with error.")`; a `JoinError` (panic/cancel) as
+  `error!("Component task failed unexpectedly.")`.
+- `lib/saluki-core/src/runtime/supervisor.rs` — the supervisor with `OneForOne`/`OneForAll` restart
+  (supervisor.rs:477-481) applies to the **internal supervisor only** (control plane / internal
+  telemetry / env), assembled at run.rs:185-202. It does **not** wrap data components. This is the
+  "crucial split" from SUT §2.
+
+## Key observation / honest framing
+
+The fail-stop guarantee is real and structural: there is exactly one `JoinSet` for data components
+and exactly one `wait_for_unexpected_finish` arm that converts any single completion into
+whole-topology shutdown. The defensible invariant:
+
+- **Always:** whenever a data component task terminates before an operator-initiated shutdown
+  (SIGINT), the process initiates topology-wide shutdown — it never continues running the remaining
+  components as a partial pipeline. Equivalently: there is no state where component count has
+  decreased due to an unexpected finish *and* the topology keeps processing.
+- **Reachable:** the `wait_for_unexpected_finish` → shutdown path is actually hit when a data
+  component is induced to finish (proves fail-stop fires, not just that it exists).
+
+Caveat to state honestly: between the moment a component finishes and the moment shutdown completes,
+the pipeline is transiently "half-running" (other components still alive, draining). That window is
+*bounded by the 30s shutdown* (see `graceful-shutdown-within-30s`) and is by design. The invariant
+is about **not silently staying** half-running, not about instantaneous teardown.
+
+## Failure scenarios (Antithesis angle)
+
+- **Induce a component panic** (e.g. trigger one of the hot-path `.expect`/`unreachable!` sites in
+  SUT §7 #14; note the sub-second `aggregate_window_duration` panic of §7 #8 is now closed upstream)
+  → component task ends with
+  `JoinError` → `wait_for_unexpected_finish` fires → process shuts down. Assert the shutdown path is
+  reached and the process exits (s6 then restarts). Falsify if the pipeline keeps running with a
+  dead component.
+- **Component returns Ok unexpectedly** (clean finish mid-run, e.g. a source whose loop exits on a
+  closed socket) → same fail-stop path (running.rs:140 warn). Confirms even a "successful" early
+  finish triggers shutdown.
+- **Forwarder task exits on permanent error** → fail-stop.
+- Distinguish from SIGINT: under SIGINT the `ctrl_c` arm (run.rs:284) wins, `topology_failed` stays
+  false. The property is specifically about the **non-SIGINT** completion arm.
+
+## Config dependencies
+
+- Number/identity of data components depends on enabled pipelines (run.rs:414-457). The invariant
+  holds regardless, but the workload should know which component it is killing.
+- Internal-supervisor components are explicitly **excluded** — a control-plane component failing at
+  runtime is handled by the supervisor (run.rs:263-271 logs and continues), NOT by this fail-stop.
+  Do not assert fail-stop for internal-supervisor component failures.
+
+## Assertion (MISSING — net-new instrumentation)
+
+No Antithesis SDK assertions exist. Proposed SUT-side:
+- In the `wait_for_unexpected_finish` select arm (run.rs:280-283), before/at the `error!` log:
+  `assert_reachable!("data component unexpectedly finished → process shutting down")`.
+- To express the Always invariant in-process is awkward (it is enforced by control flow). Best
+  approach: a workload-side temporal assertion — *whenever* the
+  `"Topology component unexpectedly finished. Shutting down..."` log appears (or any component-death
+  telemetry), the process must subsequently exit (and not continue serving). Antithesis
+  query-logs/temporal checks (event A always precedes process-exit B) fit this well.
+- Optionally instrument `handle_task_result` (running.rs) to emit a distinct telemetry/log on
+  unexpected component finish so the workload can detect the half-running window and assert it is
+  always followed by shutdown.
+
+## Open questions
+
+- **Is `wait_for_unexpected_finish` always being polled?** It is one arm of the run.rs:255 `select!`.
+  Once any other arm completes (SIGINT, internal supervisor finish) the `select!` exits and
+  `wait_for_unexpected_finish` is no longer polled — but at that point shutdown is already happening,
+  so the invariant still holds. Confirm there is no window after topology spawn (run.rs:239) but
+  before the `select!` (run.rs:255) where a component could finish unobserved. The intervening code
+  (run.rs:241-253) spawns a detached readiness task and sets two bools — quick and non-awaiting on
+  the topology — so the gap is negligible, but worth noting. WHY IT MATTERS: a component dying in
+  that gap would still be caught by `join_next_with_id` once the select polls (JoinSet buffers
+  completions), so likely safe; confirm.
+- **`expect("no components to wait for")` (running.rs:45):** if the topology has zero components,
+  `join_next_with_id()` returns `None` and this panics. Could an empty topology be built? run.rs:401
+  errors out if `!data_pipelines_enabled()`, and `create_topology` adds at least a forwarder +
+  components for any enabled pipeline, so a spawned topology should be non-empty. WHY IT MATTERS: a
+  panic here would itself trigger process exit (still fail-stop-ish) but via an ugly path. WHAT
+  CHANGES: low priority; document as a defensive-panic site.
+
+## SUT-side instrumentation needs
+
+- Antithesis SDK dependency (none today).
+- Reachable marker on the run.rs:280 unexpected-finish arm.
+- A distinct log/telemetry event on unexpected component finish to anchor a workload-side temporal
+  "death-implies-exit" assertion.
diff --git a/test/antithesis/scratchbook/properties/ddsketch-bin-count-bounded.md b/test/antithesis/scratchbook/properties/ddsketch-bin-count-bounded.md
new file mode 100644
index 00000000000..3344deb2b4f
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/ddsketch-bin-count-bounded.md
@@ -0,0 +1,108 @@
+---
+slug: ddsketch-bin-count-bounded
+title: DDSketch bin count never exceeds bin_limit
+type: Safety
+priority: Medium
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+status: assertion MISSING (as Antithesis); strong unit + proptest coverage exists
+---
+
+# ddsketch-bin-count-bounded — DDSketch bin count never exceeds bin_limit
+
+## Property (one sentence)
+After any sequence of inserts, multi-weight inserts, interpolations, and merges, an agent
+`DDSketch`'s bin count never exceeds `bin_limit` (4096).
+
+## Origin
+- SUT analysis §5 safety #4: "bin count must never exceed 4096".
+- Doc/comment in `agent/sketch.rs` collapse logic and `trim_left` ("leaving exactly bin_limit bins").
+
+## Files / functions / lines (CONFIRMED)
+- `lib/ddsketch/build.rs` (2–4): `AGENT_DEFAULT_BIN_LIMIT = 4096`, `AGENT_DEFAULT_EPS = 1/128`,
+  `AGENT_DEFAULT_MIN_VALUE = 1e-9`; emitted as `DDSKETCH_CONF_BIN_LIMIT`.
+- `lib/ddsketch/src/agent/config.rs` (8, 14–15): `Config.bin_limit` set from `DDSKETCH_CONF_BIN_LIMIT`.
+- `lib/ddsketch/src/agent/sketch.rs`
+  - `trim_left` (689–714): enforces the cap. `if bin_limit == 0 || bins.len() <= bin_limit { return; }`
+    else drains the lowest `bins.len() - bin_limit` bins, folding their mass into the first kept
+    bin → "leaving exactly bin_limit bins" (712–713).
+  - Every mutation path calls `trim_left(.., SKETCH_CONFIG.bin_limit)`:
+    `insert` (358), `insert_keys` (319), `insert_key_counts` (255), `merge` (579).
+  - `generate_bins` (716–735): a single `(k, n)` with `n >= u32::MAX` could emit multiple bins
+    for one key (overflow split, 731–733); but `trim_left` runs immediately after every caller,
+    so the post-operation bin count is still capped. (Historic regression fixed: with old u16
+    bins a large weight exploded bin count — test `trim_left_respects_bin_limit_with_large_weights`
+    824–846.)
+  - `bin_count()` (90–92): `self.bins.len()` — the value to assert on.
+- Existing tests already assert this: unit tests (760–891) and **proptests**
+  `prop_bin_count_never_exceeds_limit` (924–936), `prop_output_bins_are_sorted_and_distinct`,
+  `prop_output_keys_are_highest_from_input`, etc. (919–1023).
+- ADP entry into the agent sketch: `aggregate/mod.rs:7` `use ddsketch::DDSketch` (re-export of
+  `agent::DDSketch`, `lib.rs:56`); built in `transform_and_push_metric` (743–745) via
+  `DDSketch::default()` + `insert_n` per histogram sample, and distributions flow as `SketchPoints`.
+- Separate impl: the **canonical** `DDSketch` (`canonical/sketch.rs`) uses
+  `CollapsingLowestDenseStore` with `max_num_bins` (default 2048) and an `assert!(max_num_bins >= 1)`
+  (collapsing_lowest.rs:37), collapsing on growth (67–87). Not on the aggregate hot path, but the
+  same invariant applies if/where it is used.
+
+## Failure scenario (Antithesis angle)
+The unit/proptests run only under `cargo test` on isolated inputs. Antithesis adds value by
+checking the invariant **live, on real production sketches**, after arbitrary interleavings:
+- Histogram→distribution conversion inserting thousands of distinct sample values per flush
+  (743–745) feeding `insert_n` with large weights, then `merge`d across windows.
+- Merge of many incoming agent sketches (future "take sketches shipped by the agent, merge them"
+  use case, sketch.rs:33–35) where each `merge` (542–582) extends then `trim_left`s.
+- A code path that mutates bins **without** calling `trim_left` (e.g. a new insert helper, or
+  `insert_raw_bin` 491–495 which intentionally does NOT trim) escaping the cap — exactly the
+  kind of regression a live `Always` assertion would catch that the targeted tests would not.
+
+## Observations
+- This invariant is structurally enforced today; the Antithesis assertion is a **regression
+  tripwire** placed at the sketch boundary, valuable because the cap is re-established by a
+  separate `trim_left` call at each mutation site (easy to miss when adding a new mutator).
+- SUT-side instrumentation strongly preferred: `bin_count()` is internal and per-sketch; a
+  workload-side checker cannot see individual sketches mid-pipeline.
+
+## Config dependencies
+- `bin_limit` is compile-time (build.rs), not runtime-configurable for the agent sketch.
+
+## Suggested assertion
+- `assert_always(self.bins.len() <= SKETCH_CONFIG.bin_limit as usize,
+  "DDSketch bin count within bin_limit")` placed at the **end of every mutating method**
+  (`insert`, `insert_n`/`insert_keys`/`insert_key_counts`, `merge`, `insert_interpolate_buckets`)
+  — i.e. one shared check after `trim_left`.
+- `Reachable("DDSketch trim_left collapsed bins")` to confirm the workload actually drives a
+  sketch past the limit (otherwise the `Always` is vacuously true and proves nothing).
+
+## Open questions
+- `insert_raw_bin` is `#[cfg(test)]`/`pub(crate)` test-only (490) and bypasses `trim_left` — confirm
+  it can never be reached in a release build (otherwise it is a hole in the invariant).
+- `generate_bins` overflow split for `n >= u32::MAX`: under truly extreme single-key weights, does
+  the transient (pre-`trim_left`) bin vector allocation matter for memory? Probably not, but worth
+  a `Sometimes` if huge weights are in scope.
+
+## Investigation Log
+
+#### Which DDSketch (agent 4096 vs canonical 2048) is on ADP's live aggregation path
+- **Examined**: `lib/ddsketch/src/lib.rs:46-56` (module layout + crate-root re-export);
+  `lib/ddsketch/build.rs:2-4,45` (`AGENT_DEFAULT_BIN_LIMIT = 4096`, `AGENT_DEFAULT_EPS = 1/128`);
+  `lib/ddsketch/src/agent/config.rs` (`Config`, generated `DDSKETCH_CONF_BIN_LIMIT`),
+  `agent/sketch.rs:255,319,358,829` (`trim_left(..., SKETCH_CONFIG.bin_limit)`);
+  `transforms/aggregate/mod.rs:7` (`use ddsketch::DDSketch`), `:740-750` (distribution build via
+  `DDSketch::default()` + `insert_n`); `bin/agent-data-plane/src/cli/run.rs:491-498`
+  (metrics pipeline => `dd_metrics_encode`), `encoders/datadog/metrics/mod.rs:5,467,841,1006`;
+  grep of `ddsketch::canonical` / `canonical::DDSketch` usage across `lib` and `bin`.
+- **Found**: The crate root re-exports the **agent** implementation:
+  `pub use agent::{Bin, Bucket, DDSketch};` (`lib.rs:56`) — so `ddsketch::DDSketch` *is* the agent
+  sketch (bin_limit **4096**, eps 1/128). The aggregate transform imports `ddsketch::DDSketch`
+  (`mod.rs:7`) and uses it in the histogram->distribution conversion path (`mod.rs:743-750`). The
+  DD metrics encoder also uses `ddsketch::DDSketch` (`metrics/mod.rs:5`) for Histogram/Distribution
+  values. The **canonical** sketch (`ddsketch::canonical`, `max_num_bins`-based,
+  `CollapsingLowestDenseStore`, relative_accuracy 0.01) has **no non-test usage** in `lib` or
+  `bin` — it is library-only / not wired into any ADP topology component.
+- **Not found**: Any live ADP component constructing `ddsketch::canonical::DDSketch`.
+- **Conclusion**: RESOLVED. Only the **agent sketch (bin_limit 4096)** is on the live ADP path
+  (aggregate distribution build + DD metrics sketch encoding). The canonical sketch (2048) is not
+  reachable in production. The bin-count assertion should target the agent sketch's
+  `SKETCH_CONFIG.bin_limit == 4096` exclusively; canonical can be dropped from scope.
diff --git a/test/antithesis/scratchbook/properties/ddsketch-no-nan-poison.md b/test/antithesis/scratchbook/properties/ddsketch-no-nan-poison.md
new file mode 100644
index 00000000000..5c78274472e
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/ddsketch-no-nan-poison.md
@@ -0,0 +1,171 @@
+---
+slug: ddsketch-no-nan-poison
+title: A NaN sample never silently poisons a sketch's sum/avg
+type: Safety / Reachability
+priority: High
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+status: assertion MISSING; CONFIRMED sketch boundary does NOT guard finiteness
+---
+
+# ddsketch-no-nan-poison — A NaN sample never silently poisons a sketch's sum/avg
+
+## Property (one sentence)
+A single NaN (or other non-finite) sample must never silently poison a sketch's `sum`/`avg`;
+for any finite input stream the sketch's `sum`/`avg` stay finite, and the sketch boundary
+rejects or sanitizes non-finite values rather than absorbing them.
+
+## Origin
+- SUT analysis §7 #10 (Wildcard): "NaN poisons a DDSketch (`agent/sketch.rs:188-206`):
+  `sum`/`avg` go NaN permanently; finiteness is guarded per-source (DSD codec), not at the
+  sketch boundary — fragile if a new producer is added."
+
+## Files / functions / lines (CONFIRMED)
+- `lib/ddsketch/src/agent/sketch.rs`
+  - `adjust_basic_stats(v, n)` (188–206): **NO finiteness check.**
+    `self.sum += v * n as f64;` (198) → if `v` is NaN, `sum` becomes NaN permanently (NaN is
+    sticky under `+`). `self.avg += (v - self.avg)/count` (201) / the `n>1` branch (205) →
+    `avg` also goes NaN. `min`/`max` comparisons (189, 193) are all false for NaN, so a NaN
+    leaves min/max unchanged but silently corrupts sum/avg.
+  - Entry points that call `adjust_basic_stats` with caller-supplied `v` and **no NaN reject**:
+    `insert(v)` (327–330), `insert_n(v,n)` (374–384, calls `adjust_basic_stats` at 380 for n>1
+    and `insert` for n==1), `insert_many` (362–371), `insert_interpolate_bucket` (426, 440),
+    `insert_raw_bin` (493).
+  - `Config::key(NaN)` (config.rs 70–87): every comparison with NaN is false, so it falls to
+    `log_gamma(NaN).round_ties_even() as i32` = 0, `key = norm_bias`, clamped to `[1, MAX_KEY]`
+    → NaN gets a **valid bin** (count incremented) while sum/avg are poisoned: the sketch looks
+    populated but its sum/avg are NaN. `count` still increments (197), so `is_empty()` is false
+    and `sum()`/`avg()` return `Some(NaN)`.
+  - `quantile` (535): `.or(Some(f64::NAN))` can itself yield NaN as a fallback — distinct from
+    poisoning, but means a NaN out of `quantile` is not by itself proof of poisoning.
+- ADP call site (the boundary in question): `transform_and_push_metric` (744–745):
+  `sketch.insert_n(sample.value.into_inner(), sample.weight.0 as u64)` — calls `insert_n`
+  **directly**, with no finiteness guard at this boundary.
+- Per-source guard that exists today (the *only* current protection):
+  - DSD codec drops non-finite float values (SUT analysis §8 "drop non-finite floats in codec";
+    §7 #7 `non_finite_metric_values_are_silently_dropped`). So in the current DSD pipeline NaN
+    is filtered upstream — but this is **not** enforced at the sketch boundary, so any new
+    producer (OTLP path, replay, future sources) that reaches `insert_n` bypasses the guard.
+- `stele` diff comparison (`metrics.rs` 175–182) compares sketch `sum`/`avg` with
+  `approx_eq_ratio` — a NaN sum makes any comparison false, so poisoning would surface as a
+  diff-test mismatch *if* the harness happened to feed a NaN; the deterministic happy-path
+  workload does not.
+
+## Failure scenario (Antithesis angle)
+1. A producer reaches the sketch boundary with a NaN/±Inf sample value (a new source, a replay
+   record with a corrupt value, or a regression that removes the codec guard). `insert_n`
+   absorbs it; `sum`/`avg` go NaN for the lifetime of that sketch and propagate through
+   `merge` (551–552) into every sketch it touches and downstream into the emitted distribution
+   → permanently wrong customer data, silently.
+2. `weight` non-finite is not possible (`u64`), but `value.into_inner()` is an `f64` with no
+   guarantee of finiteness at this call site.
+
+## Observations
+- The cheapest robust assertion is **at the sketch boundary** (inside `adjust_basic_stats` or at
+  the top of `insert`/`insert_n`/`insert_many`), because that is the single chokepoint and is
+  exactly where the missing guard lives. SUT-side instrumentation strongly wins; workload-side
+  can only observe a NaN sum after it has already propagated downstream.
+- Two framings:
+  - **Outcome invariant:** `assert_always(self.sum.is_finite() && self.avg.is_finite(),
+    "sketch sum/avg finite")` after each mutation — for a workload that only injects finite
+    values, this catches any internal NaN production; with NaN injection it documents the
+    poisoning.
+  - **Boundary invariant:** if a finiteness guard is added, `assert_unreachable("non-finite
+    value reached DDSketch::adjust_basic_stats")` to prove NaN never gets absorbed.
+
+## Config dependencies
+- None directly. Reachability depends on which producers feed the aggregate sketch path
+  (DSD codec currently filters; OTLP/replay/future sources may not).
+
+## Suggested assertion
+- Primary: `assert_always(v.is_finite(), "value reaching DDSketch is finite")` at the top of
+  `adjust_basic_stats` (covers all insert/merge-feeding paths) — OR, if the boundary is changed
+  to reject, `assert_unreachable` on the absorbed-NaN path.
+- Secondary outcome check: `assert_always(self.sum.is_finite(), "DDSketch.sum finite")` after
+  mutations, as a backstop for internally produced non-finite (e.g. overflow → Inf).
+- `Reachable("non-finite sample offered at sketch boundary")` only if the workload deliberately
+  injects NaN past the codec — otherwise keep it `Unreachable`-style to assert the guard holds.
+
+## Open questions
+- Should the fix **reject/skip** the NaN at the sketch boundary (matching the codec's drop
+  policy) or **clamp**? Rejecting keeps count/sum consistent; the Agent's policy here should be
+  confirmed against the diff baseline (ties into `aggregate-matches-agent`).
+- `quantile`'s `.or(Some(f64::NAN))` fallback (535): is a NaN quantile result distinguishable
+  from a poisoned-sketch NaN downstream? The assertion should target sum/avg, not quantile, to
+  avoid conflating the two.
+- Is `+Inf`/`-Inf` (e.g. from an overflowing `sum`) in scope? `v.is_finite()` covers both NaN
+  and Inf; confirm the desired policy treats them identically.
+
+### Investigation Log
+
+#### Does any non-DSD producer reach the DDSketch insert boundary without the codec FloatIter finiteness filter?
+- **Examined:** the finiteness filter (`lib/saluki-io/src/deser/codec/dogstatsd/metric.rs:254,
+  273-303` — `FloatIter` skips non-finite with `value.is_finite()` at :299 and a debug log at :301);
+  every agent-DDSketch insert caller in the tree (`grep` for `insert`/`insert_n`/`insert_many`/
+  `insert_interpolate_buckets`/`add_n`); and the ADP topology wiring in
+  `bin/agent-data-plane/src/cli/run.rs:462-499, 593-686, 745-755`.
+  Specifically traced: (a) OTLP — `lib/saluki-components/src/sources/otlp/metrics/translator.rs`;
+  (b) self-telemetry — `lib/saluki-core/src/observability/metrics/mod.rs:299-310`,
+  `processor.rs`; (c) the aggregate histogram→distribution path —
+  `lib/saluki-components/src/transforms/aggregate/mod.rs:737-762`; (d) checks_ipc —
+  `lib/saluki-components/src/sources/checks_ipc/mod.rs:185-204`; (e) the datadog metrics encoder —
+  `lib/saluki-components/src/encoders/datadog/metrics/mod.rs:1043-1061`.
+- **Found:**
+  - **Confirmed: the sketch boundary itself has no finiteness guard.** `agent/sketch.rs` `insert`
+    (:327), `insert_n` (:374), `insert_many` (:362), `insert_interpolate_bucket` (:387) all call
+    `adjust_basic_stats` (:188) which does `self.sum += v * n` unconditionally — NaN poisons
+    sum/avg permanently. `FloatIter` (codec) is the *only* finiteness filter in the metric path.
+  - **Aggregate transform `insert_n` path (the flagged mod.rs:745) is DSD-ONLY → CLOSED.** The
+    aggregate transform (`dsd_agg`) is wired **exclusively** into the DSD pipeline:
+    `dsd_in → dsd_enrich → dsd_prefix_filter → dsd_tag_filterlist → dsd_agg → dsd_post_agg_filter →
+    metrics_enrich` (run.rs:664-679). DSD metrics pass through `FloatIter` at decode time, so the
+    `Histogram` samples reaching `aggregate/mod.rs:745` (`sketch.insert_n(sample.value...)`) are
+    already finite. **checks_ipc and OTLP metrics join the topology at `metrics_enrich`
+    (run.rs:469/499 and run.rs:753), which is DOWNSTREAM of `dsd_agg`** — they never enter the
+    aggregate transform. So no non-DSD producer reaches `insert_n` *in the aggregate transform*.
+  - **OTLP number path → CLOSED.** `get_number_data_point_value` (translator.rs:1366) feeds
+    `is_skippable` (`value.is_nan() || value.is_infinite()`, :1374-1377) in both
+    `map_number_metrics` (:726) and `map_number_monotonic_metrics` (:754); non-finite values are
+    skipped with a warn/debug log. Gauges/counters never carry NaN downstream.
+  - **OTLP histogram/sketch path → effectively CLOSED (no NaN poisoning).** OTLP histograms become
+    sketches via two routes, neither of which feeds a raw NaN into `adjust_basic_stats`:
+    (i) exponential histograms build a `Dogsketch` proto and use `DDSketch::try_from`
+    (`build_agent_sketch_from_key_counts`, :314-351) — never calls `insert`/`adjust_basic_stats`;
+    (ii) explicit-bounds histograms call `qa.insert_interpolate_buckets(buckets)` (:889) where bucket
+    `upper_limit` comes from the payload's `explicit_bounds`, which is **not** finiteness-checked.
+    However, `insert_interpolate_bucket` (sketch.rs:387) only ever passes
+    `SKETCH_CONFIG.bin_lower_bound(key)` (a finite reconstructed value) into `adjust_basic_stats`
+    (:426, :440), never the raw bound. A NaN bound makes `distance`/`fkn` NaN → `fkn as u64 == 0` →
+    no per-key insert; the remainder branch uses a finite `bin_lower_bound`. So a NaN explicit bound
+    can distort *bucketing* but does not poison sum/avg with NaN. (`insert_interpolate_buckets` at
+    :465-481 handles ±Inf explicitly but not NaN — a latent robustness gap, not a poisoning path.)
+  - **LIVE non-DSD NaN→sketch path FOUND: checks_ipc Histogram → datadog metrics encoder.**
+    - `checks_ipc/mod.rs:195`: `MetricType::Histogram => Metric::histogram(context, (timestamp,
+      value))` where `value` is the raw f64 from an external Python check over IPC. **No `is_finite`
+      / `is_nan` check anywhere in this decode** (mod.rs:185-204). A check emitting NaN produces a
+      `Histogram` metric carrying NaN.
+    - That metric flows `checks_ipc_in.metrics → metrics_enrich → dd_metrics_encode` (run.rs:469,
+      499) — i.e. it does NOT pass through the DSD codec FloatIter and does NOT enter the aggregate
+      transform.
+    - The encoder converts `MetricValues::Histogram` to a sketch by calling **`ddsketch.insert_n(
+      sample.value.into_inner(), sample.weight...)`** at `encoders/datadog/metrics/mod.rs:1054`
+      (inside `encode_sketch_metric`, the `Histogram` arm at :1049-1058). This is a direct
+      `insert_n` of the raw sample value with no finiteness guard → `adjust_basic_stats` →
+      `sum += NaN`. The emitted Datadog sketch payload then carries a NaN sum/avg silently.
+  - `distribution_sampled_fallible` (value/mod.rs:312, `insert_n`) is called ONLY from the DSD codec
+    (metric.rs:267, fed by `FloatIter`) → DSD-only, safe.
+- **Not found:** No finiteness filter on the checks_ipc value path; none at the sketch boundary;
+  none in the encoder's Histogram→sketch conversion. No code path where the *aggregate transform*
+  receives non-DSD input.
+- **Conclusion:** RESOLVED, and the hazard is **LIVE** (not closed on all paths). The specifically
+  flagged aggregate `insert_n` path (aggregate/mod.rs:745) is closed because that transform is
+  DSD-only. OTLP number and histogram paths do not poison sum/avg. **But there is a live non-DSD
+  NaN-poisoning path: a Python check emitting a Histogram metric with a NaN value via checks_ipc
+  (checks_ipc/mod.rs:195, no finiteness check) reaches `DDSketch::insert_n` in the Datadog metrics
+  encoder (encoders/datadog/metrics/mod.rs:1054), bypassing both the DSD FloatIter and the aggregate
+  transform.** Therefore ddsketch-no-nan-poison and the ghost-metric hazard remain LIVE. The robust
+  fix is a guard at the sketch boundary (`adjust_basic_stats`/`insert*`), since the per-producer
+  filter (FloatIter) demonstrably does not cover the checks_ipc→encoder path. The Antithesis angle:
+  drive a check (or checks_ipc IPC input) that emits a NaN histogram value and assert sketch sum/avg
+  stay finite at the encoder boundary.
diff --git a/test/antithesis/scratchbook/properties/ddsketch-relative-error-bound.md b/test/antithesis/scratchbook/properties/ddsketch-relative-error-bound.md
new file mode 100644
index 00000000000..aca5871ea1d
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/ddsketch-relative-error-bound.md
@@ -0,0 +1,128 @@
+---
+slug: ddsketch-relative-error-bound
+title: DDSketch quantiles within relative-error bound; merges associative/commutative
+type: Safety
+priority: Low
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+status: assertion MISSING
+---
+
+# ddsketch-relative-error-bound — Quantile accuracy + merge associativity/commutativity
+
+## Property (one sentence)
+For values within the non-collapsed representable range, quantile queries are within the
+configured relative error (eps ≈ 0.78%, gamma = 1 + 2·eps), and merging sketches is
+associative and commutative (order of merges does not change the result).
+
+## Origin
+- SUT analysis §5 safety #4: "DDSketch relative-error guarantee: eps=1/128 (~0.78%) … merge
+  associative/commutative."
+
+## Files / functions / lines (CONFIRMED)
+- `lib/ddsketch/build.rs` (3, 14–38): `eps = 1/128`; `eps *= 2; gamma_v = 1 + eps;
+  gamma_ln = ln_1p(eps)`; `norm_min`, `norm_bias` derived; `assert!(norm_min <= min_value)`.
+- `lib/ddsketch/src/agent/config.rs`
+  - `key(v)` (70–87): `log_gamma(v).round_ties_even() + norm_bias`, **clamped to `[1, MAX_KEY]`**
+    (86). Values with `|v| < norm_min` map to key 0 (75–77); values above `gamma^MAX_KEY`
+    saturate at `MAX_KEY` (i.e. INF bucket). **Accuracy is NOT guaranteed at these extremes** —
+    this is the caveat the property must scope around.
+  - `bin_lower_bound(k)` (47–62): inverse mapping; `gamma_v.powf(k - norm_bias)`.
+- `lib/ddsketch/src/agent/sketch.rs`
+  - `quantile(q)` (498–536): rank via `rank(count,q) = round_ties_even(q*(count-1))` (668–671);
+    interpolates `v_low*weight + v_high*(1-weight)` with `v_high = v_low * gamma_v` (522–523);
+    result `clamp(self.min, self.max)` (535); empty → `None`; `q<=0 → min`, `q>=1 → max`.
+  - `merge(other)` (542–582): merges basic stats (count/min/max/sum/avg) then bins, then
+    `trim_left`. Bin merge is order-independent on keys; `avg` uses an incremental formula
+    (552) that is **not** exactly order-independent in floating point.
+- Canonical impl (`canonical/mapping/logarithmic.rs`): `relative_accuracy() = (gamma-1)/(gamma+1)`
+  (114–115); `index = ceil(log(value)/log(gamma))` (9, 97). `with_relative_accuracy` rejects
+  accuracy outside `(0,1)` (40–47). Separate from the agent sketch but same family.
+
+## Failure scenario (Antithesis angle)
+The diff-test (`stele` 175–182) compares sketches on min/max/avg/sum (ratio 1e-8) + exact
+count + exact bin_count, on a deterministic clean run. Antithesis adds:
+1. **Merge order under interleaving:** windows flushed/merged in different orders (delayed
+   flush, backpressure reordering) could expose non-associativity. The bin merge is exact on
+   counts, but `avg` (incremental, 552) and `sum` accumulate floating-point error that depends
+   on merge order → a `quantile`/avg drift the diff test (single fixed order) never sees.
+2. **Quantile error at the boundary of the representable range:** values near `norm_min` (1e-9)
+   or above the max key collapse to key 0 / INF; quantiles there can exceed the 0.78% bound
+   (the documented caveat, sketch.rs:48-51 and canonical "extremes"). The property must assert
+   the bound only for in-range values and `Sometimes`-observe out-of-range handling.
+3. **Collapsed sketch:** once `trim_left`/`CollapsingLowest` collapses low bins, the relative
+   error guarantee for low quantiles is intentionally void (`is_collapsed`,
+   collapsing_lowest.rs:50-51). Assertion must exclude collapsed-low-quantile queries.
+
+## Observations
+- Two checkable sub-properties:
+  - (a) **Accuracy:** for an inserted value `v` within range, `quantile(q)` for the rank of `v`
+    is within `gamma`-relative error of `v`: `|q_est - v| <= eps_rel * |v|` where
+    `eps_rel = (gamma_v - 1)/(gamma_v + 1) ≈ 1/128`.
+  - (b) **Merge invariance:** `merge(A, merge(B,C)) ≈ merge(merge(A,B), C)` and
+    `merge(A,B) ≈ merge(B,A)` within ratio, on bins exactly and on sum/avg within a small
+    floating tolerance.
+- Best validated SUT-side with a known input set so the expected rank value is computable;
+  workload-side cannot reconstruct per-sample ground truth from emitted aggregates.
+
+## Config dependencies
+- eps/gamma/bin_limit are compile-time (build.rs). `aggregate_window_duration` controls how
+  many samples land in one sketch before flush/merge (affects collapse likelihood).
+
+## Suggested assertion
+- `assert_always(quantile_within_relative_error, "DDSketch quantile within eps for in-range value")`
+  evaluated in a SUT-side test harness over in-range inputs (exclude key-0 / INF / collapsed-low).
+- `assert_always(merge_result_equal_within_ratio, "DDSketch merge is order-independent")` over a
+  set of sketches merged in two different orders.
+- `Sometimes(value_out_of_representable_range)` to confirm the extreme-value carve-out is exercised
+  (and to document that accuracy is not claimed there).
+
+## Open questions
+- What floating tolerance for `avg`/`sum` under reordered merges is acceptable vs the 1e-8 the
+  diff test uses? The incremental `avg` (552) and `sum +=` (551) are order-sensitive in f64;
+  need a principled epsilon to avoid false positives.
+(All prior open questions resolved — see Investigation Log below.)
+
+## Investigation Log
+
+#### Which sketch on the live path, and does ADP call `DDSketch::quantile` at runtime?
+- **Examined**: `lib/ddsketch/src/lib.rs:56` (crate-root = agent sketch);
+  `lib/ddsketch/src/agent/sketch.rs:498` (`pub fn quantile`); `lib/ddsketch/build.rs:2-4`
+  (eps = 1/128, build.rs doubles it: `eps *= 2.0` then `gamma_v = 1+eps`, lines 19-20);
+  `lib/ddsketch/src/canonical/mapping/fixed.rs:38` (`RELATIVE_ACCURACY = 0.01`);
+  `transforms/aggregate/mod.rs:735-799` (histogram statistics emit) and `config.rs:58-69`
+  (`value_from_histogram`); `lib/saluki-core/src/data_model/event/metric/value/histogram.rs:166-197`
+  (`HistogramSummary::quantile`); grep of `.quantile(` across `lib`+`bin` excluding ddsketch internals;
+  `bin/agent-data-plane/src/cli/run.rs:491-498` (live metrics pipeline -> `dd_metrics_encode`);
+  `encoders/datadog/metrics/mod.rs:1006-1150` (`encode_sketch_metric` / sketch serialization);
+  `destinations/prometheus/mod.rs:343-348` (the one `sketch.quantile(q)` runtime call site).
+- **Found (a) — sketch type**: Live aggregation uses the **agent** sketch (`ddsketch::DDSketch`,
+  eps 1/128). The canonical sketch (`fixed.rs` RELATIVE_ACCURACY 0.01, 2048 bins) is not wired into
+  any ADP component (confirmed in companion `ddsketch-bin-count-bounded` log). So the accuracy
+  target is the agent sketch's fixed relative accuracy, not the canonical 0.01.
+- **Found (b) — quantile NOT queried on the live path**: ADP does **not** call
+  `DDSketch::quantile` at runtime on the production metrics path. Two distinct percentile paths
+  exist, neither using `DDSketch::quantile` live:
+  (1) Histogram-mode percentiles in aggregate go through `HistogramStatistic::value_from_histogram`
+  (`config.rs:67`) -> `summary.quantile(q)`, which is `HistogramSummary::quantile`
+  (`histogram.rs:172-197`) operating on **raw sorted samples** — a separate structure, NOT a
+  DDSketch. (2) Distribution-mode builds a `DDSketch` (`mod.rs:743-750`) and the DD metrics encoder
+  serializes it via `encode_sketch_metric`, writing `sketch.bins()` keys/counts plus
+  count/min/max/avg/sum (`metrics/mod.rs:1135-1150`) — it ships the raw bins over the wire and never
+  queries a quantile. Quantile estimation happens **server-side at Datadog** after ingestion.
+  The only runtime `DDSketch::quantile` caller in the codebase is the **prometheus** destination
+  (`prometheus/mod.rs:346`), which is internal-telemetry/scrape only and not in the primary
+  metrics topology.
+- **Not found**: Any live ADP metrics-path call to `DDSketch::quantile`; any production use of the
+  canonical sketch.
+- **Conclusion**: RESOLVED (with a framing consequence). The accuracy assertion's "live" target is
+  the **agent sketch (eps 1/128)**, and ADP **does not query DDSketch quantiles at runtime on the
+  customer metrics path** — it ships raw bins to Datadog. Therefore an `Always(quantile within eps)`
+  assertion cannot be anchored to a production runtime call; it must be an **SUT-side test-harness**
+  assertion over the agent sketch in isolation (the existing unit/proptest level), OR retargeted to
+  the property that actually matters in production: *bins/summary are serialized faithfully and bin
+  count stays capped* (covered by `ddsketch-bin-count-bounded`). The histogram-percentile path that
+  IS computed in-process uses raw-sample `HistogramSummary::quantile`, which is exact (no DDSketch
+  relative-error bound applies). Property framing should be narrowed: the DDSketch relative-error
+  guarantee is a library invariant, not a live ADP runtime invariant.
diff --git a/test/antithesis/scratchbook/properties/disk-persisted-retry-survives-restart.md b/test/antithesis/scratchbook/properties/disk-persisted-retry-survives-restart.md
new file mode 100644
index 00000000000..0798d2a02ef
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/disk-persisted-retry-survives-restart.md
@@ -0,0 +1,175 @@
+---
+slug: disk-persisted-retry-survives-restart
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Liveness (with safety sub-clauses: no-duplication, poison-drop)
+priority: High
+assertion_status: MISSING (net-new instrumentation)
+---
+
+# Property: Disk-persisted retry transactions survive process kill+restart and are eventually delivered exactly once
+
+## Origin
+SUT analysis §2 (disk persistence), §3 (two disk-backed subsystems), §6 gap #3
+("disk-persisted retry queue recovery never tested across a real kill+restart"), §9
+open question (persisted.rs disk-full/partial-write/corrupt across crash). No Antithesis
+assertion exists.
+
+## What the code does
+
+### Persistence enable + silent fallback
+`lib/saluki-components/src/common/datadog/io.rs:391-409`: a `RetryQueue` is created; if
+`config.retry().storage_max_size_bytes() > 0`, `with_disk_persistence(...)` is awaited. On
+**init failure it logs and silently falls back to an in-memory-only `RetryQueue`** (~405-408):
+"Failed to initialize disk persistence ... Transactions will not be persisted." This is a
+durability *downgrade* with no hard failure — the operator believes persistence is on but it isn't.
+
+### Flush to disk on shutdown
+`io.rs:488-503`: on endpoint-task shutdown, `pending_txns.flush().await` is called; `flush`
+(`io.rs:781-800`) pushes high-priority into low-priority then `self.low_priority.flush()`, which
+persists outstanding transactions to disk **only if disk persistence is enabled** (`io.rs:769-776`
+doc: "If disk persistence isn't enabled, all pending transactions will be dropped"). A flush error
+logs "Events may be permanently lost" (~500-501).
+
+### Exactly-once on consume (delete-before-return)
+`lib/saluki-io/src/net/util/retry/queue/persisted.rs` `try_deserialize_entry` (~373-397): after
+deserializing, it **deletes the file from disk before returning** (~391-394) "so that we don't risk
+sending duplicates." So a successful pop removes the on-disk copy first — a crash *after* delete but
+*before* send loses that one txn (at-most-once for the in-flight item), while a crash *before* delete
+keeps it (at-least-once). The delete-before-return biases toward no-duplication.
+
+### Poison/corrupt entry handling (drop, don't loop forever)
+- `pop` (~206-243): on `try_deserialize_entry` `Err(e)` (corrupt/unreadable), it logs
+  "Permanently dropping persisted entry that could not be consumed", decrements `total_on_disk_bytes`,
+  increments `entries_dropped`, and `continue`s — does NOT retry the poison entry forever, does NOT
+  abort recovery (~227-241).
+- `remove_until_available_space` eviction path (~304-323): same poison handling during eviction.
+- `try_deserialize_entry` deserialize failure (~373-389): attempts to `remove_file` the corrupt
+  entry so it doesn't accumulate, tolerates removal failure.
+- `refresh_entry_state` (~245-273): unrecognized files are warned and skipped, not fatal.
+
+## Failure scenario (Antithesis)
+1. Enable disk persistence (`forwarder_retry_queue_storage_max_size > 0`).
+2. Drive a known set of transactions; induce an intake outage so they land in the retry queue.
+3. SIGKILL the ADP process mid-flow (the s6 container supervisor restarts it).
+4. Restore healthy intake.
+5. Expectation: every persisted retryable transaction is eventually delivered, **exactly once**
+   (no loss, no duplication). Separately: inject a corrupted on-disk entry and assert recovery
+   continues and the corrupt entry is dropped (not retried forever, not crashing recovery).
+
+## Key observations
+- "Exactly once" is approximate at the crash boundary: delete-before-return means at most one
+  in-flight txn can be lost on a crash in the delete→send gap, and at-least-once if crash precedes
+  delete. The clean claim is **no systemic loss and no duplication of the persisted backlog**; the
+  single in-flight item is a known narrow window.
+- SIGKILL (not graceful) skips the shutdown flush (`io.rs:488-503`), so only transactions *already
+  written to disk* survive; high-priority in-memory txns not yet persisted are lost. The graceful
+  path (SIGTERM/30s) flushes them to disk. The property must distinguish kill vs graceful.
+- Retry-queue IDs are stable across API-key rotation (`io.rs:514-533`) so a restart with a rotated
+  key still finds and retries the same persisted backlog — relevant if the workload rotates keys.
+
+## Config deps
+- `forwarder_retry_queue_storage_max_size` (`storage_max_size_bytes`) > 0 to enable persistence.
+- `storage_path`, `storage_max_disk_ratio` — disk-full eviction behavior
+  (`remove_until_available_space`, `on_disk_bytes_limit`).
+
+## Suggested assertion (MISSING — net-new)
+- **Sometimes(persisted-backlog-fully-recovered)**: at least once, after a kill+restart with
+  persistence enabled and intake restored, the set of transactions delivered post-restart covers the
+  persisted backlog with no duplicates (reconcile workload input vs mock-intake received, dedup by
+  transaction identity). Liveness + no-dup.
+- **AlwaysOrUnreachable(poison-dropped)**: whenever a corrupt on-disk entry is encountered, it is
+  dropped (entries_dropped increments) and recovery proceeds — never an infinite retry of the same
+  entry and never a recovery abort. Anchor at `persisted.rs:227-241` / `304-323`.
+- **Reachable(disk-persistence-actually-active)**: confirm persistence init succeeded (the
+  silent-fallback at `io.rs:405-408` did NOT fire) — otherwise the whole property is vacuously testing
+  in-memory mode. Treat the fallback as an Unreachable in the persistence-enabled workload, OR detect
+  it and fail the run setup.
+
+## SUT-side instrumentation needs
+- SDK `assert_unreachable` (or workload detection) at the silent-fallback branch `io.rs:405` when
+  persistence is configured — to catch the durability downgrade that would otherwise make the test
+  vacuous.
+- SDK `assert_reachable` at the poison-drop `continue` (`persisted.rs:238`) gated on the
+  corrupt-entry test variant.
+- Primary check is workload-side reconciliation against the mock intake with transaction-identity
+  dedup; needs a deterministic countable input set and a mock intake that records received IDs.
+
+## Open questions
+- **Ordering after restart**: `refresh_entry_state` sorts by timestamp (~268) but filename timestamp
+  has second granularity + nonce; confirm restart preserves enough ordering that the
+  bias-to-freshest/oldest-drop semantics aren't inverted across a restart (affects which txns survive
+  overflow, not raw loss).
+- **The narrow at-most-once window** (delete-before-return then crash before send): is the single
+  in-flight txn loss acceptable per the headline guarantee, or should the assertion tolerate it? Sets
+  whether the reconcile allows a 1-txn slack.
+
+### Investigation Log
+
+#### Durability-downgrade visibility + torn-write classification + recovery wedging (2026-05-28)
+
+**Examined:**
+- `lib/saluki-components/src/common/datadog/io.rs:391-410` (RetryQueue create + `with_disk_persistence`
+  + silent fallback) and grep of io.rs for `persistence`/`fallback`/metric near the branch.
+- `lib/saluki-io/src/net/util/retry/queue/persisted.rs`: `try_from_path` (:30-37),
+  `decode_timestamped_filename` (:410-427), `generate_timestamped_filename` (:400-408), `push`
+  (:164-199), `pop` (:206-243), `refresh_entry_state` (:245-273), `try_deserialize_entry` (:354-398),
+  `remove_until_available_space` poison handling (:304-330), and tests
+  `pop_skips_corrupt_entry`/`pop_returns_none_when_all_entries_corrupt` (:701-795).
+
+**Found — (a) durability downgrade is surfaced ONLY as an `error!` log, no metric:**
+- On disk-persistence init failure, io.rs:405-408 runs `.unwrap_or_else(|e| { error!(endpoint_url,
+  error = %e, "Failed to initialize disk persistence for retry queue. Transactions will not be
+  persisted."); RetryQueue::new(queue_id, config.retry().queue_max_size_bytes()) })`. The only
+  observable signal is that one `error!` log line; there is **no metric/gauge/counter** emitted to
+  distinguish "persistence active" from "fell back to in-memory". Grep of io.rs for
+  persistence/fallback finds only the doc comments (:393, :773-775) and this log (:406). So a
+  workload cannot detect the downgrade via telemetry — it must scrape the log or, better, treat the
+  fallback branch as an `assert_unreachable` when persistence is configured (as the file already
+  recommends). Confirmed the downgrade is effectively silent at the metrics layer.
+- **(a, cont.) the in-memory byte cap STILL holds after fallback:** the fallback constructs
+  `RetryQueue::new(queue_id, config.retry().queue_max_size_bytes())` (io.rs:407) — identical
+  in-memory cap to the non-persisted path (io.rs:391). So in degraded mode the queue is a plain
+  capped in-memory queue with drop-oldest; the byte-cap invariant is preserved (it just becomes the
+  drop-not-spill branch). No unbounded growth from the fallback.
+
+**Found — torn/partial write is classified as CORRUPT (drop+warn), and does NOT wedge recovery:**
+- `push` writes via `tokio::fs::write(&entry_path, &serialized)` (persisted.rs:184) directly to the
+  final `retry-<ts>-<nonce>.json` path. NOTE: there is NO temp-file + atomic rename despite the
+  stale comment at :165 ("Serialize the entry to a temporary file"). So a SIGKILL mid-write leaves a
+  file with a **valid filename** but **truncated/partial JSON content**.
+- On restart, `refresh_entry_state` (:245-273) scans the dir and calls `PersistedEntry::try_from_path`
+  (:253), which validates ONLY the filename via `decode_timestamped_filename` (:31, :410-427) — it
+  does NOT read or validate content. A torn write has a well-formed name, so it is accepted into
+  `entries` (not skipped as "unrecognized"). Unrecognized files (bad name) are warned and skipped
+  (:255-262) and the scan `continue`s past them.
+- When `pop` reaches the torn entry, `try_deserialize_entry` reads the bytes and `serde_json::from_slice`
+  fails (:373-375) → best-effort `remove_file` of the corrupt file (:378-384) → returns `Err`
+  (:386-387). `pop` matches that `Err` (:227-240): logs `warn!` "Permanently dropping persisted entry
+  that could not be consumed", decrements `total_on_disk_bytes`, `entries_dropped += 1`, and
+  `continue`s the loop to the next entry (:230-239). So a torn write = **corrupt → dropped (with
+  warn), counted in `entries_dropped`** — NOT treated as unrecognized-skip.
+- A single bad file does **NOT** wedge recovery: `pop`'s `loop` advances to the next entry on every
+  corrupt/poison hit (no infinite retry of the same entry — explicit comment at :228-229), and the
+  eviction path `remove_until_available_space` has the same poison handling (:313-322). The "all
+  entries corrupt" case returns `Ok(None)` cleanly (test at :774-795). Recovery proceeds past any
+  number of bad files.
+- Edge case: the `Ok(None)` branch in `pop` (:221-225, file vanished mid-recovery / `NotFound` at
+  :358-364) triggers a `refresh_entry_state` + retry — also non-wedging, since the missing file is
+  simply dropped from the rescanned set.
+
+**Not found:** No metric for the persistence-fallback downgrade. No atomic write/rename or fsync in
+`push` (torn writes are possible and are handled at read time, not prevented at write time). No code
+path where a corrupt/torn file aborts recovery or is retried indefinitely.
+
+**Conclusion (RESOLVED):** (a) The durability downgrade on disk-init failure is surfaced ONLY via a
+single `error!` log — no metric — so the workload must treat the fallback branch (io.rs:405) as an
+`assert_unreachable` (or log-scrape) to avoid a vacuous in-memory-mode test; the in-memory byte cap
+still holds after fallback. (b) A torn/partial write across a kill is classified as a **corrupt
+entry**: it is dropped with a `warn!` and `entries_dropped` increments, losing just that one
+transaction; it does **not** wedge recovery — both the `pop` scan and the eviction scan continue past
+any number of corrupt/unrecognized files. This validates the proposed `AlwaysOrUnreachable(poison-
+dropped)` assertion and confirms the "torn write loses one txn, not the backlog" framing. (Caveat for
+the workload: because writes are non-atomic, the corrupt-entry test variant can be produced naturally
+by SIGKILL mid-write, not only by injecting a hand-crafted corrupt file.)
diff --git a/test/antithesis/scratchbook/properties/events-sc-no-silent-loss.md b/test/antithesis/scratchbook/properties/events-sc-no-silent-loss.md
new file mode 100644
index 00000000000..473023643a5
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/events-sc-no-silent-loss.md
@@ -0,0 +1,113 @@
+---
+slug: events-sc-no-silent-loss
+title: Events and service-checks are delivered without silent loss under backpressure/outage
+type: Liveness (with a Safety no-silent-drop clause)
+priority: High
+status: net-new (no SDK assertion exists)
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+---
+
+# events-sc-no-silent-loss
+
+## Origin
+Coverage gap: the catalog's data-loss family (`no-silent-interconnect-drop`,
+`forwarder-eventual-delivery`, `shutdown-drains-no-loss`) is reasoned and instrumented entirely on
+the **metrics** path (encoder = `dd_metrics_encode`, the aggregation pipeline). The events and
+service-check sub-pipelines are always-on production paths with a **different shape** — no
+aggregation buffer, a straight `dsd_in.{events,service_checks} → *_enrich → dd_{events,service_checks}_encode
+→ dd_out` chain (`run.rs:681-684`) — and their own encoders with their own silent-drop branch. None
+of the existing properties assert events/SC reach the forwarder; this fills that gap. (It EXTENDS
+`no-silent-interconnect-drop` / `forwarder-eventual-delivery` rather than duplicating them: same
+faults, different always-on edges and a different silent-loss site.)
+
+## Code paths (file:line)
+- Wiring: `bin/agent-data-plane/src/cli/run.rs:681-684` — events and service_checks edges; both
+  terminate at `dd_out` (the shared Datadog forwarder).
+- Source dispatch fan-out: `lib/saluki-components/src/sources/dogstatsd/mod.rs:1667-1716`
+  (`dispatch_events`): `extract(is_eventd)` → `buffered_named("events").send_all(...)` then
+  `extract(is_service_check)` → `buffered_named("service_checks").send_all(...)`. `send_all` awaits
+  on a full bounded mpsc (backpressure), but a dispatch **error** is only `error!`-logged and
+  swallowed (`mod.rs:1688`, `mod.rs:1702`) — no drop counter on this path.
+- Events encoder silent-drop branch: `lib/saluki-components/src/encoders/datadog/events/mod.rs:179-194`
+  — on a **recoverable** encode error the event is dropped and only `events_dropped_encoder()`
+  incremented (`telemetry.rs:50,77`); TODO admits the dropped count is hardcoded `1`, not the real
+  number (`events/mod.rs:186`). Flush build error is `error!`-logged and the request discarded
+  (`events/mod.rs:208`).
+- Service-checks encoder twin: `lib/saluki-components/src/encoders/datadog/service_checks/mod.rs:177-211`
+  — identical recoverable-drop + flush-discard structure.
+- Wrong-type silent swallow: encoder `process_event` calls `event.try_into_eventd()` /
+  `try_into_service_check()` and returns `ProcessResult::Continue` (consuming + dropping) when the
+  type does not match (`events/mod.rs:173-177`, `service_checks/mod.rs:171-175`;
+  `data_model/event/mod.rs:167-182`). A mis-routed or mistyped event is lost here with NO counter —
+  ties to `source-dispatch-no-misroute`.
+- Zero-payload-size config trap: `events/mod.rs:64-67` documents that `serializer_max_payload_size: 0`
+  makes **every** non-empty compressed payload exceed the limit and be dropped during flush (a silent
+  total-loss config). Same clamp logic applies to service checks.
+- Egress (shared with metrics): the `dd_out` forwarder retry/circuit-breaker/queue-drop behavior is
+  already characterized in `forwarder-eventual-delivery` / `retry-queue-bounded-under-outage`.
+
+## Failure scenario
+Under a slow/throttled or transiently-down intake, the encoder→forwarder edge fills and backpressure
+should propagate up the events/SC edges to the source read loop (queue-and-await, never drop). Two
+silent-loss risks specific to these paths: (1) the encoder's recoverable-error branch drops
+events/SC with an undercounted (`+1`) telemetry signal; (2) a wrong-type event reaching the encoder
+is swallowed with no counter at all. After a transient intake outage clears, every accepted event/SC
+that did not legitimately overflow the (shared) retry queue should still be delivered. A regression
+that turns a backpressure-await into a drop, or mis-scopes the recoverable-error branch, silently
+loses customer events/service-checks — the "won't lose customer data" half of the headline, on a
+path no existing property watches.
+
+## Observations
+- Events/SC have **no aggregation stage**, so unlike metrics there is no flush-window semantics —
+  every accepted event/SC should map ~1:1 to a delivered intake item (modulo batching of up to
+  `MAX_EVENTS_PER_PAYLOAD = 100`, `events/mod.rs:35`). This makes a count-in == count-out reconcile
+  cleaner than for metrics.
+- The `events_received` / `service_checks_received` source counters share the metric name
+  `component_events_received_total` distinguished only by a `message_type` tag
+  (`sources/dogstatsd/metrics.rs:111-119`) — the workload checker must filter by tag, not name.
+- `events_sent` (`telemetry.rs:41,83`) on the encoders is the delivery-side anchor.
+
+## Suggested assertions (MISSING / net-new)
+- Safety: `Always(no silent drop on a wired events/SC edge under load)` — modeled like
+  `no-silent-interconnect-drop` but asserted on the events + service_checks edges; backpressure
+  (await), never discard, on a connected output.
+- Liveness: `Sometimes(all-accepted-events-delivered-after-recovery)` and
+  `Sometimes(all-accepted-service-checks-delivered-after-recovery)` — post-recovery delivered count
+  (`events_sent`, filtered) ≥ accepted count (`events_received` by `message_type`), minus legitimate
+  retry-queue overflow. Liveness ⇒ progress, not an instantaneous invariant.
+- Reachability anchors (REQUIRED to prevent vacuity, esp. for a metrics-heavy workload):
+  `Sometimes(events_received{message_type=events} > 0)` and
+  `Sometimes(service_checks_received > 0)`.
+- Optional Safety guard: `Always(events_dropped_encoder delta == 0)` while intake is healthy and
+  config is non-pathological — catches the recoverable-error drop firing when it shouldn't.
+
+## Config dependencies
+- DSD enabled; events/service_checks on by default (`mod.rs:205-221`).
+- Keep `serializer_max_payload_size` / `serializer_max_uncompressed_payload_size` at non-pathological
+  values for the "no-loss" branch; a separate negative case can set `serializer_max_payload_size: 0`
+  to confirm the documented total-drop trap (`events/mod.rs:64-67`).
+- Shared forwarder/retry-queue config (disk persistence, queue byte caps) governs the eventual-
+  delivery branch exactly as for `forwarder-eventual-delivery`.
+
+## SUT-side instrumentation needs
+- Workload-side: drive an event/SC stream, throttle/down the mock intake, then restore; reconcile
+  accepted (`component_events_received_total{message_type in (events, service_checks)}`) vs delivered
+  (`component_events_sent_total` on the events/SC encoders) at the mock intake, allowing for retry-
+  queue overflow and ~`MAX_*_PER_PAYLOAD` batching slack.
+- The dispatch-error path (`mod.rs:1688,1702`) and the wrong-type swallow have NO counter — a strict
+  no-silent-loss assertion needs net-new SUT-side instrumentation (a drop counter or an
+  `assert_unreachable`) there, else loss on those branches is invisible to a workload checker.
+
+## Open Questions
+- Is the encoder "recoverable error" branch (`events/mod.rs:183`) ever hit on healthy intake with
+  well-formed events, or only on genuinely oversized single events? Determines whether the optional
+  `Always(events_dropped_encoder == 0)` guard is sound or flaky.
+- Does the events/SC retry traffic share `dd_out`'s per-endpoint queue with metrics, so a metric
+  flood can evict queued events (cross-stream eviction)? Affects how the overflow allowance is scoped.
+- Are events/SC requests `Clone` (so retryable failures take the `Error::Open` re-enqueue path), as
+  was confirmed for the metrics forwarder requests in `forwarder-eventual-delivery`? Needs checking
+  for the `/api/v1/events_batch` and service-check request builders.
+- Does `dispatch_events` count anything when `send_all` errors, or is dispatch-time loss fully silent
+  (a finding, shared with `source-dispatch-no-misroute`)?
diff --git a/test/antithesis/scratchbook/properties/events-sc-pipeline-reachable.md b/test/antithesis/scratchbook/properties/events-sc-pipeline-reachable.md
new file mode 100644
index 00000000000..675576e1eda
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/events-sc-pipeline-reachable.md
@@ -0,0 +1,78 @@
+---
+slug: events-sc-pipeline-reachable
+title: Events and service-check sub-pipelines are actually exercised (anti-vacuity anchor)
+type: Reachability
+priority: Medium
+status: net-new (no SDK assertion exists)
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+---
+
+# events-sc-pipeline-reachable
+
+## Origin
+Coverage gap + anti-vacuity guard. The entire existing 27-property catalog is metrics-only. A
+realistic ADP workload is dominated by metric samples; events and service-checks are comparatively
+rare. Without an explicit reachability anchor, the two new event/SC safety/liveness properties
+(`malformed-event-sc-no-crash`, `events-sc-no-silent-loss`) can pass **vacuously** — the assertions
+never fail simply because no event ever traversed the parse → enrich → encode → deliver chain. This
+property exists to make the event/SC paths' execution a first-class, observable test obligation, per
+the catalog-wide note that `Sometimes(...)` anchors are mandatory to prove a path is reached
+(`property-catalog.md` "Catalog-wide notes").
+
+## Code paths (file:line)
+- Parse: `lib/saluki-io/src/deser/codec/dogstatsd/event.rs:31` /
+  `.../service_check.rs:28` (codecs decode a frame into `ParsedPacket::Event` / `::ServiceCheck`).
+- Source accept + counter: `lib/saluki-components/src/sources/dogstatsd/mod.rs:1502-1517` increments
+  `events_received()` on a successfully handled event; `mod.rs:1519-1537` increments
+  `service_checks_received()`. Counters: `sources/dogstatsd/metrics.rs:34-39,114-119` (both emit
+  `component_events_received_total` with a distinguishing `message_type` tag).
+- Dispatch onto the named outputs: `mod.rs:1679-1704` (`buffered_named("events")` /
+  `buffered_named("service_checks")`).
+- Delivery: events encoder `encoders/datadog/events/mod.rs:197-213` and service-checks encoder
+  `encoders/datadog/service_checks/mod.rs:195-211` dispatch an `HttpPayload`; success increments
+  `events_sent` (`common/datadog/telemetry.rs:83`) → reaches `dd_out` → mock intake
+  `/api/v1/events_batch` (events) and the service-checks intake endpoint.
+
+## Failure scenario
+Not a SUT bug per se — a **test-quality** failure: if this anchor never fires, the event/SC
+properties provide no real assurance. It also surfaces a genuine SUT regression class: a wiring or
+filter change (e.g. `EnablePayloadsConfiguration` defaulting events/SC off, a future filter dropping
+all events, a broken named output) that silently removes the event/SC path entirely would make this
+`Sometimes` go unsatisfied — a real, observable defect on an "always-on production path."
+
+## Observations
+- Defaults make the path live: `EnablePayloadsConfiguration { events: true, service_checks: true }`
+  (`sources/dogstatsd/mod.rs:205-221`); the edges are unconditionally wired in `run.rs:681-684`
+  (not behind a feature flag like `dsd_debug_log_out`).
+- Two milestones are worth separate anchors so a parse-but-don't-deliver regression is visible:
+  (a) **parsed/accepted** at the source, (b) **delivered** at the encoder/intake.
+
+## Suggested assertions (MISSING / net-new)
+- `Sometimes(event_parsed_and_accepted)` — at least once `events_received`
+  (`component_events_received_total{message_type=events}`) advances.
+- `Sometimes(service_check_parsed_and_accepted)` — `service_checks_received` advances.
+- `Sometimes(event_delivered)` / `Sometimes(service_check_delivered)` — the events/SC encoder's
+  `events_sent` advances and a payload reaches the mock intake's events / service-check endpoint.
+- (Strengthen to `Reachable` if the workload guarantees ≥1 well-formed event + SC per run.)
+
+## Config dependencies
+- DSD enabled; events/service_checks left at their `true` defaults (`mod.rs:205-221`).
+- Workload MUST emit at least one well-formed event (`_e{...}`) and one well-formed service check
+  (`_sc|...`) so the anchors can fire — this is a workload-construction requirement, not a SUT config.
+
+## SUT-side instrumentation needs
+- Source-side anchors read the existing `component_events_received_total` counter (filter by
+  `message_type` tag) — no new instrumentation strictly required for the "parsed/accepted" milestone.
+- Delivery-side anchors read `component_events_sent_total` on the events/SC encoders and/or observe
+  the mock intake receiving an events/service-check payload — the cleanest signal is a mock-intake
+  observation, which the deployment topology's controllable mock intake already supports.
+
+## Open Questions
+- Should the delivery anchor key on the encoder `events_sent` counter or on the mock intake actually
+  receiving the `/api/v1/events_batch` (and service-check) POST? Intake observation is stronger
+  (proves end-to-end) but depends on the mock intake distinguishing those endpoints.
+- Is one anchor per stream sufficient, or do we want per-(event vs service-check) AND
+  per-(parsed vs delivered) granularity (4 anchors) to localize a parse-but-not-deliver regression?
+  Leaning toward 4 for diagnostic value at negligible cost.
diff --git a/test/antithesis/scratchbook/properties/filter-config-reload-correct.md b/test/antithesis/scratchbook/properties/filter-config-reload-correct.md
new file mode 100644
index 00000000000..9a7d2568788
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/filter-config-reload-correct.md
@@ -0,0 +1,147 @@
+# filter-config-reload-correct
+
+## Origin
+
+The design partner's documented focus (Confluence "Tag Filter RC Relay Stress Test: agent + ADP",
+AMCC space): the Core Agent pushes metric-filter config (`metric_tag_filterlist`, `metric_filterlist`,
+`statsd_metric_blocklist`, …) over the Remote Config → AgentSecure → ADP config stream **at runtime,
+while data is flowing**. Five components rebuild correctness-affecting filtering state live from that
+stream. The existing `config-runtime-update-not-revalidated` treats config purely as a crash/
+incompatibility gate; it never treats a config update as a **data-correctness** event. This property
+fills that gap: a botched live reload produces *stale* or *fully-cleared* filtering applied to live
+customer metrics — wrong tags retained/dropped, or all filtering silently disabled.
+
+## Code paths
+
+### The watcher (shared mechanism)
+
+- `lib/saluki-config/src/dynamic/watcher.rs:36-74` — `FieldUpdateWatcher::changed`:
+  - **Hazard 1 — silent lag drop:** `Err(broadcast::error::RecvError::Lagged(_))` (`:61-67`) only
+    `warn!`s and `continue`s. The broadcast channel has capacity **100** (`lib.rs:363`,
+    `broadcast::channel(100)`). Under a burst of config changes (or a slow consumer task), a receiver
+    that falls >100 behind **silently loses** the intervening `ConfigChangeEvent`s. If the *latest*
+    state was in the dropped span and no further change to that key arrives, the component keeps
+    **stale filters forever** with no recovery — there is no re-read of current config on lag.
+  - **Hazard 2 — partial-deserialize skip:** `:42-57` — `serde_json::from_value::<T>(...).ok()`; if
+    the new value fails to deserialize, it `warn!`s and **skips the update** (loops). A
+    multi-key/multi-entry filter config where one entry is malformed can leave the component on the
+    **previous** config (half-applied across a multi-watcher component — see below).
+- Each watcher is an independent `broadcast::Receiver` (`lib.rs:797-798,821-824`); a component with N
+  watchers has N receivers that can lag/skip **independently** → a partially-updated filter set.
+
+### Hazard 3 — key deletion never fires (the silent clear-all, subtler than expected)
+
+- `lib/saluki-config/src/dynamic/diff.rs:12-48` — `diff_recursive` iterates **only `new_dict` keys**.
+  A key present in the *old* config but **absent in the new snapshot produces NO `ConfigChangeEvent`.**
+  So *deleting* `metric_tag_filterlist` from the streamed config does **not** notify the watcher; the
+  component keeps the **old** filters (stale), it does not clear them.
+- The clear-all DOES happen when the key is delivered as an explicit empty/null value:
+  - `tag_filterlist/mod.rs:274-276` — on a `changed()` event,
+    `compile_filters(new_entries.as_deref().unwrap_or(&[]))`. If `new_value` deserializes to `None`
+    (e.g. explicit `null` or a shape that fails per-element but yields `Some([])`), this rebuilds with
+    **`&[]` → ALL tag filtering removed**, and **rebuilds the context cache** (`build_context_cache()`).
+  - `dogstatsd_prefix_filter/mod.rs:311-334` and `dogstatsd_post_aggregate_filter/mod.rs:290-313` use
+    `if let Some(new) = maybe_new { … }` — they *ignore* a `None`, so an explicit-empty arriving as
+    `None` is a **no-op** there, but an explicit empty `[]` (deserializes to `Some(vec![])`) clears
+    the list.
+- Net effect: **deletion = stale (Hazard 3a)**, **explicit-empty = cleared (Hazard 3b)**, and the two
+  filter families (`tag_filterlist` vs `prefix_filter`/`post_agg_filter`) react **differently** to a
+  `None`. This inconsistency across the five components is a correctness hazard in itself.
+
+### The five live-reloading components
+
+1. `bin/agent-data-plane/src/components/tag_filterlist/mod.rs:222,274-277` — 1 watcher
+   (`metric_tag_filterlist`); rebuilds `self.filters` + `self.context_cache` live.
+2. `bin/agent-data-plane/src/components/dogstatsd_prefix_filter/mod.rs:285-289,311-334` — **4 watchers**
+   (`metric_filterlist`, `metric_filterlist_match_prefix`, `statsd_metric_blocklist`,
+   `statsd_metric_blocklist_match_prefix`); each rebuilds the effective blocklist matcher.
+3. `bin/agent-data-plane/src/components/dogstatsd_post_aggregate_filter/mod.rs:268-273,290-313` — **4
+   watchers**; rebuilds the histogram-suffix matcher.
+4. & 5. Any other `watch_for_updates` consumers on correctness-affecting keys (grep
+   `watch_for_updates` — the prefix/post-agg filters share the same four keys, so a single key change
+   fans out to multiple components that must stay mutually consistent).
+
+## Failure scenario
+
+While ADP forwards live metrics, the Core Agent pushes a rapid sequence of filter-config updates
+(the RC relay stress test). One of:
+
+- **Lag:** the burst exceeds the 100-slot broadcast buffer; a filter component's receiver lags, the
+  `Lagged` arm drops the events, and the component keeps applying **stale** filters (e.g. still
+  excluding a tag the operator just re-included, or still forwarding a metric just added to the
+  blocklist) with no self-correction.
+- **Partial:** one malformed entry in a multi-entry `metric_tag_filterlist` update fails
+  deserialization → the whole update is skipped → stale filtering; or, in `prefix_filter`, one of the
+  four keys updates while another's event is dropped → a **half-applied** filter config (new
+  blocklist, old match-prefix flag) that filters inconsistently.
+- **Clear-all:** an explicit empty `[]` (or a `None`-deserializing value) for `metric_tag_filterlist`
+  rebuilds with `&[]` → **all tag filtering silently disabled** on live data (tags the customer
+  intended to drop now flow to intake); deleting the key entirely instead leaves filtering **stale**
+  (the opposite surprise).
+
+All are silent (warn-only at most) and customer-visible (wrong tags / wrong metrics forwarded).
+
+## Property
+
+- **Type:** Safety (data-correctness under live config reload).
+- **Invariant:**
+  - `Always(after a filter-config update is acknowledged-applied, the next metric for an affected
+    name is filtered per the NEW config)` — i.e. no stale filtering after a settled update. Assert
+    SUT-side at the filter apply site by comparing the metric's post-filter tags/keep-decision to the
+    currently-loaded `CompiledFilters`/matcher.
+  - `Unreachable("filter update Lagged-dropped with no subsequent reconciliation")` on the
+    `RecvError::Lagged` arm (`watcher.rs:61`) — or, if lag is accepted as best-effort, `Reachable`
+    there + a liveness check that the component eventually converges to the latest config (it does
+    not, today — there is no re-read).
+  - `Sometimes(filter config reloaded while metrics in flight)` to prove the reload-under-load state
+    is reached (non-vacuity).
+  - `AlwaysOrUnreachable(tag filtering not silently fully-cleared by a config event that the operator
+    did not intend as a clear)` — distinguishes deletion (should-stay-or-explicitly-clear) from
+    explicit-empty.
+- **Antithesis angle:** the core interaction is **burst + scheduling**: push many filter updates
+  faster than the filter task drains its broadcast receiver (node throttling / CPU modulation on
+  `adp` widens the lag window), interleaved with sustained metric load so the stale/half-applied/
+  cleared window overlaps live data. Also explore (a) deletion vs explicit-empty vs malformed-entry
+  shapes, and (b) updating one of the four prefix-filter keys while starving another receiver.
+- **Priority:** High (this is the design partner's explicit stress scenario; correctness under live
+  RC reload is the headline of the AMCC Confluence page).
+
+## Config dependencies
+
+- Dynamic config must be **enabled** (remote-agent mode, Add-on 1 topology with a Core Agent / config
+  stub). In standalone mode `watch_for_updates` returns a watcher whose `rx` is `None` →
+  `changed()` pends forever (`watcher.rs:30-33`) and none of this fires. **This property cannot run in
+  standalone mode.**
+- Keys to drive: `metric_tag_filterlist`, `metric_filterlist`, `metric_filterlist_match_prefix`,
+  `statsd_metric_blocklist`, `statsd_metric_blocklist_match_prefix` (constants in
+  `dogstatsd_filterlist.rs`).
+- The Core Agent/stub must be able to send **malformed**, **explicit-empty**, **key-deleting**
+  (snapshot omitting the key), and **bursty** sequences the real Agent might not — favor a stub.
+
+## Open Questions
+
+- Is the broadcast `Lagged` drop considered acceptable (best-effort) by the team, or a bug? There is
+  **no re-read of current config** on lag, so a dropped *final* update is permanent staleness.
+  `(needs human input)`
+- Deletion-doesn't-diff (Hazard 3a): is the intended Agent semantics that removing a filter key from
+  RC should *clear* the filter? If so, ADP's additive `diff_recursive` (`diff.rs:12-48`) silently
+  diverges (keeps stale filters). Confirm against Agent RC semantics.
+- The `tag_filterlist` vs `prefix_filter`/`post_agg_filter` asymmetry on a `None` update
+  (clear-vs-ignore): intended or accidental? It means the same RC action has different effects on
+  different filters.
+- Cross-component consistency: a single `metric_filterlist` change fans out to both `prefix_filter`
+  and `post_agg_filter` via separate receivers — can they diverge transiently (one applied, one
+  lagged) and does that produce an observable wrong-filter window?
+- `tag_filterlist` only filters **Counter + sketch** metrics (`tag_filterlist/mod.rs:235-237`); does
+  the Agent's equivalent filter the same metric-type subset? (See `tag-filterlist-applied-consistently`.)
+
+### Investigation Log
+
+- Examined: `watcher.rs` (full), `diff.rs` (full), `lib.rs:363,541-651,797-824` (broadcast cap 100,
+  dynamic updater, subscribe/watch), and the `select!` reload arms in all three filter components.
+- Found: (1) `Lagged` is warn-and-continue with no re-read → permanent staleness on a dropped final
+  update; (2) partial-deserialize is warn-and-skip → stale/half-applied; (3) `diff_recursive` is
+  additive so key **deletion** emits no event (stale), while explicit-empty/`None` clears for
+  `tag_filterlist` but is ignored by the prefix/post-agg filters. Three distinct hazards, all
+  silent, all on live customer data.
+- Confirmed: in standalone mode the watcher never fires (`rx: None`), so this needs Add-on 1.
diff --git a/test/antithesis/scratchbook/properties/forwarder-eventual-delivery.md b/test/antithesis/scratchbook/properties/forwarder-eventual-delivery.md
new file mode 100644
index 00000000000..98864f58b74
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/forwarder-eventual-delivery.md
@@ -0,0 +1,165 @@
+---
+slug: forwarder-eventual-delivery
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Liveness
+priority: High
+assertion_status: MISSING (net-new instrumentation)
+---
+
+# Property: After a transient intake outage clears, accepted-and-retryable transactions are eventually delivered
+
+## Origin
+SUT analysis §5 liveness #5 ("After a transient intake outage clears, queued data is
+eventually delivered") and §2 egress description. Headline guarantee's *no silent loss*
+half, in the egress/forwarder path. No Antithesis assertion exists.
+
+## What the code does
+
+### Retry model = circuit breaker + re-enqueue (not inline retry)
+`lib/saluki-components/src/common/datadog/io.rs`:
+- In-flight completion handler (~451-482). On a circuit-breaker-open result
+  `Err(RetryCircuitBreakerError::Open(req))` (~468-474): the request is reassembled into a
+  `Transaction` and **re-enqueued to the low-priority queue** via `pending_txns.push_low_priority`.
+  Only if *that re-enqueue itself errors* does it log "Events may be permanently lost." On
+  success it tracks queue drops (overflow eviction) telemetry.
+- `process_http_response` (~541-563): success → `track_successful_transaction`. Non-success →
+  `track_permanently_failed_transaction` (these are statuses the classifier did NOT mark
+  retryable — see below — so they are permanent drops by design).
+- `Err(RetryCircuitBreakerError::Service(e))` (~460-463): an error the **retry policy declined to
+  retry** → `track_permanently_failed_transaction` (permanent drop). Per the breaker logic below,
+  this branch is reached only for *non-retryable* outcomes; retryable transport errors do NOT land
+  here.
+
+### Circuit breaker mechanics — what becomes Open vs Service
+`lib/saluki-io/src/net/util/middleware/retry_circuit_breaker.rs` `ResponseFuture::poll` (~95-128):
+after the inner request completes, it calls `state.policy.retry(&mut req, &mut result)`.
+- `Some(backoff)` (policy says retry) → arms the shared backoff and returns `Err(Error::Open(req))`
+  carrying the original request (~101-112). This is what the io.rs handler re-enqueues.
+- `None` (policy declines) or request-not-cloneable → `Err(Error::Service)` (~113-121), the permanent
+  branch in io.rs.
+The policy wraps `StandardHttpClassifier` whose `should_retry(Err(_)) == true`
+(`classifier/http.rs:78-83`) and `StandardHttpRetryLifecycle` which explicitly categorizes
+DNS/connection/TLS transport errors (`lifecycle/http.rs:76-100`). **Therefore connection resets,
+timeouts surfacing as transport errors, and 5xx are routed to `Error::Open` → re-enqueued, NOT
+dropped via `Service`.** The earlier worry that connection resets are permanently dropped is
+resolved: they are retryable and re-enqueued, provided the request was cloneable.
+
+### Retry classification — which failures are retryable
+`lib/saluki-io/src/net/util/retry/classifier/http.rs`:
+- `default_should_retry` (~12-26): **400 / 401 / 403 / 413 → NOT retried** (treated as permanent
+  client misconfig/bug). All other 4xx and all 5xx → retried. `should_retry(Err(_))` (~81) →
+  transport errors retried.
+- So: 5xx storms, timeouts (408/504), 429, 5xx, and transport errors are retryable → must be
+  eventually delivered after the outage clears. 400/401/403/413 are a permanent drop by design
+  (out of scope for the liveness claim).
+
+## Failure scenario (Antithesis)
+Accept a known set of transactions, then inject a transient intake outage: 5xx storm +
+timeouts + connection resets for a bounded window, then restore healthy 2xx. Liveness
+expectation: every transaction that was (a) accepted and (b) retryable is eventually delivered
+(observed at the mock intake), with no permanent loss — assuming the retry queue did not overflow
+(see Open Questions / overflow tension).
+
+## Key observations
+- This is a **liveness** property: the bad outcome is "never delivered." It needs an eventual
+  window after fault clearance; assert progress, not an instantaneous invariant.
+- The re-enqueue is to the **low-priority** queue, which is also the overflow target; under a
+  long outage the queue can overflow and drop *oldest* (SUT §2 two-tier `PendingTransactions`,
+  bias to freshest). So the clean liveness claim holds only for outages short enough that the
+  retry queue does not overflow. Beyond that, eventual delivery is intentionally sacrificed for
+  bounded memory (the §5-liveness-#4 tension).
+- `track_permanently_failed_transaction` and `track_queue_drops` telemetry are the observable
+  loss signals; `track_successful_transaction` is the delivery signal.
+
+## Config deps
+- `forwarder_retry_queue_max_size_bytes` (`queue_max_size_bytes`) — overflow threshold; sets how
+  long an outage can last before eventual delivery is no longer guaranteed.
+- Circuit breaker backoff schedule (exponential + jitter) — sets recovery latency, hence the size
+  of the "eventually" window the assertion must allow.
+
+## Suggested assertion (MISSING — net-new)
+- **Sometimes(all-accepted-retryable-delivered-after-recovery)**: at least once, after a transient
+  outage clears and within a bounded window, the count of delivered transactions equals the count
+  of accepted-and-retryable transactions submitted before/during the outage (queue did not overflow).
+  This proves recovery actually happens. Best evaluated workload-side by reconciling the controlled
+  input set against the mock-intake received set.
+- Supporting **Reachable**: the `Error::Open` re-enqueue path (`io.rs:468-474`) is hit at least once
+  (proves the circuit breaker engaged and re-enqueued, not silently dropped).
+
+## SUT-side instrumentation needs
+- An SDK `assert_reachable` at the re-enqueue site (`io.rs:470`) to confirm the breaker re-enqueues.
+- Primary check is workload-side reconciliation against the mock intake (needs a deterministic,
+  countable input set and a mock intake that records received transaction IDs).
+
+## Open questions
+- **Retry-queue overflow bound under the test's outage length** — must size the outage shorter than
+  overflow, or the assertion must explicitly exclude overflowed (oldest-dropped) transactions. The
+  overflow drop (`track_queue_drops`) is the legitimate bounded-memory escape valve, so eventual
+  delivery is only guaranteed for outages that don't overflow `queue_max_size_bytes`.
+
+### Investigation Log
+
+#### (a) Are forwarder requests always cloneable? (b) Is breaker backoff per-endpoint? (2026-05-28)
+
+**Examined:**
+- `lib/saluki-io/src/net/util/middleware/retry_circuit_breaker.rs` in full — `ResponseFuture::poll`
+  (:81-130), `RetryCircuitBreaker::call`/`poll_ready` (:218-258), `State`/`new` (:132-205),
+  `Layer for RetryCircuitBreakerLayer` (:164-173).
+- `lib/saluki-io/src/net/util/retry/policy/rolling_exponential.rs:95-141` (`Policy` impl, incl.
+  `clone_request`).
+- `lib/saluki-components/src/common/datadog/io.rs:351-410` (`run_endpoint_io_loop` service build incl.
+  the breaker layer at :385-388) and `:236-278` (per-endpoint task spawn loop).
+- `lib/saluki-components/src/common/datadog/transaction.rs:55-243` (`TransactionBody<B>` and
+  `Transaction<B>`, incl. `#[derive(Clone)]` at :58 and :203).
+- `lib/saluki-components/src/forwarders/datadog/mod.rs:83-132` (concrete forwarder instantiation).
+- `lib/saluki-common/src/buf/chunked.rs:102-103` (`FrozenChunkedBytesBuffer`).
+
+**Found — (a) requests ARE always cloneable on the production path (re-enqueue, not permanent drop):**
+- The breaker layer sits at io.rs:385-388, applied to `Request<TransactionBody<B>>` (the body→
+  `ClientBody` conversion is the *inner* `map_request` at :388, AFTER the breaker, by explicit design
+  per the comment at io.rs:374-376 — so the request the breaker clones/holds is
+  `Request<TransactionBody<B>>`).
+- The non-cloneable → `Error::Service` permanent-drop path (retry_circuit_breaker.rs:118-121) is
+  reached only when `state.policy.clone_request(&req)` returns `None` (:248 → `req: None` → `take()`
+  is `None` at :100,:118). The production policy is `RollingExponentialBackoffRetryPolicy`, whose
+  `clone_request` is `Some(req.clone())` unconditionally (rolling_exponential.rs:138-140) and which
+  bounds `Req: Clone` (:99). It NEVER returns `None`. (The only `None`-returning impls are
+  `NoopRetryPolicy` at policy/mod.rs:19 and the test-only `NonCloneableTestRetryPolicy` — neither is
+  on the forwarder path.)
+- The concrete `B` in production is `FrozenChunkedBytesBuffer`
+  (`TransactionForwarder<FrozenChunkedBytesBuffer>`, forwarders/datadog/mod.rs:132), which is
+  `#[derive(Clone)]` (chunked.rs:102-103). `TransactionBody<B>` is `#[derive(Clone)]` (transaction.rs:58)
+  and `Request<T>: Clone` when `T: Clone`. So `clone_request` always succeeds.
+- Therefore every retryable outcome routes to `Error::Open(req)` (retry_circuit_breaker.rs:101-112) →
+  re-enqueued to the low-priority queue at io.rs:468-474. The "non-cloneable → silent permanent loss"
+  worry is **NOT realizable** on the production forwarder. RESOLVED.
+
+**Found — (b) circuit-breaker backoff is PER-ENDPOINT (no cross-endpoint serialization):**
+- The backoff lives in `State { policy, backoff }` behind `Arc<Mutex<State<P>>>`, created fresh in
+  `RetryCircuitBreaker::new` (retry_circuit_breaker.rs:200-205). That constructor runs once per
+  `Layer::layer` call (:170-172).
+- `run_endpoint_io_loop` builds its own `service` with its own `.layer(RetryCircuitBreakerLayer::new(...))`
+  (io.rs:385-388). Crucially, each endpoint gets its **own** `run_endpoint_io_loop` task: the spawn
+  loop at io.rs:253-278 iterates `resolved_endpoints` and calls `spawn_traced_named(...,
+  run_endpoint_io_loop(...))` once per endpoint, each with its own `endpoint_rx` channel,
+  `pending_txns`/`RetryQueue`, and breaker `State`.
+- The shared upstream `service` is `.clone()`d into each task (io.rs:273), but the breaker `State`
+  `Arc<Mutex<...>>` is constructed *inside* each task's `ServiceBuilder` (io.rs:385), so each endpoint
+  has an independent `backoff`. The policy is `.clone()`d per layer (:171) but state is not shared.
+  One endpoint's open breaker (its `state.backoff = Some(...)`, retry_circuit_breaker.rs:106-108,
+  gating `poll_ready` at :222-229) cannot stall another endpoint's recovery. RESOLVED.
+
+**Not found:** No global/static breaker state, no shared backoff future across endpoints, and no
+production code path that supplies a non-cloneable request or a `None`-returning `clone_request` to
+the forwarder breaker.
+
+**Conclusion:** Both sub-questions resolved favorably. (a) Production transactions are always
+cloneable (`FrozenChunkedBytesBuffer` → `TransactionBody` → `Request`, all `Clone`; policy always
+clones), so retryable failures take the `Error::Open` re-enqueue path — no silent permanent drop via
+`Error::Service` for retryable errors. (b) Each endpoint task owns an independent circuit breaker and
+backoff, so multi-endpoint fan-out recovers per-endpoint; one slow endpoint does not serialize
+others. The "eventually" window in the liveness assertion is correctly per-endpoint. The remaining
+real caveat is unchanged: eventual delivery holds only for outages short enough that the low-priority
+retry queue does not overflow `queue_max_size_bytes` (drop-oldest).
diff --git a/test/antithesis/scratchbook/properties/graceful-shutdown-within-30s.md b/test/antithesis/scratchbook/properties/graceful-shutdown-within-30s.md
new file mode 100644
index 00000000000..10a67fa283c
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/graceful-shutdown-within-30s.md
@@ -0,0 +1,131 @@
+---
+slug: graceful-shutdown-within-30s
+title: Graceful shutdown completes within the 30s grace window without forceful kill
+family: Lifecycle Transitions & Configuration
+type: Liveness (bounded-time) + Reachability
+priority: High
+status: assertion-missing
+sut_commit: 042f41db3bd97118c38981765fd49696fce9d318
+---
+
+# graceful-shutdown-within-30s
+
+## Origin
+
+SUT analysis §5 Safety #6 ("Graceful shutdown completes within 30s without forceful kill (in-flight
+data drained)"). This slug owns the **TIMING / clean-completion** angle. The data-loss agent owns
+`shutdown-drains-no-loss` (WHAT data survives — e.g. open-window buckets dropped unless
+`flush_open_windows`, retry-queue disk flush). Keep this property about *completing cleanly in
+time*, not about which data is preserved.
+
+## Files / functions / lines
+
+- `bin/agent-data-plane/src/cli/run.rs:255-287` — the main `select!` that ends the run loop on one
+  of three triggers:
+  - internal supervisor finishes (run.rs:256-279),
+  - `running_topology.wait_for_unexpected_finish()` (run.rs:280-283) → `topology_failed = true`,
+  - `tokio::signal::ctrl_c()` (run.rs:284-286) → SIGINT, logs "Received SIGINT, shutting down…".
+- `bin/agent-data-plane/src/cli/run.rs:289-290`:
+  ```rust
+  let topology_result = running_topology.shutdown_with_timeout(Duration::from_secs(30)).await;
+  ```
+  The **30s grace window** is hard-coded here.
+- `bin/agent-data-plane/src/cli/run.rs:292-294`: after the topology shutdown completes (clean or
+  forced), the internal supervisor is told to shut down (`internal_supervisor_shutdown_tx.send(())`)
+  and awaited.
+- `bin/agent-data-plane/src/cli/run.rs:300-315`: maps `topology_result` to the final process result.
+  `Ok(())` → clean (or "clean despite errors" if `topology_failed`); `Err(e)` → propagated → exit 1.
+- `lib/saluki-core/src/topology/running.rs:71-124` — `shutdown_with_timeout`:
+  - `:72-78` sets `shutdown_deadline = now + timeout`, arms a `sleep(timeout)`, and a 5s
+    `progress_interval` for "still waiting on component(s)" logs.
+  - `:82` `self.shutdown_coordinator.shutdown()` triggers source shutdown, cascading downstream.
+  - `:86-117` loop: as each task finishes, `handle_task_result` records clean/unclean; when
+    `join_next_with_id()` returns `None` (all tasks done) → `info!("All components stopped.")` →
+    break with `stopped_cleanly` reflecting whether every component returned Ok.
+  - `:111-115` **forceful stop path**: if `shutdown_timeout` fires first →
+    `warn!("Forcefully stopping topology after shutdown grace period.")`, `stopped_cleanly = false`,
+    break (remaining component tasks are dropped/aborted by the `JoinSet` being dropped).
+  - `:119-123` returns `Ok(())` iff `stopped_cleanly`, else
+    `Err(generic_error!("Topology failed to shutdown cleanly."))`.
+- `lib/saluki-core/src/topology/running.rs:130-162` — `handle_task_result`: a component returning
+  `Ok(())` during shutdown is "stopped" (clean); `Err`/`JoinError` (panic/cancel) → unclean.
+
+## Key observation / honest framing
+
+- "Within 30s" is enforced by the `sleep(30s)` race in `shutdown_with_timeout`. The clean path
+  (`stopped_cleanly == true`, run.rs topology_result `Ok`) means **all** component tasks finished
+  and returned Ok before the deadline. The forceful path is reached only if at least one component
+  fails to stop within 30s.
+- This is a **bounded-time liveness** property. Under *bounded* in-flight load (the slug's
+  condition), the expectation is that shutdown completes cleanly within 30s — the forceful-stop
+  warning should be rare/never. Under *unbounded* or adversarial load (e.g. forwarder blocked on a
+  dead intake with a huge retry queue), the forceful path is legitimately reachable, so do not
+  assert it as Always-clean unconditionally; scope the clean-completion assertion to the
+  bounded-load workload.
+- Note the **internal supervisor** shutdown (run.rs:294, `_ = internal_supervisor_handle.await`) has
+  **no timeout** — it awaits indefinitely. The 30s bound applies only to the data topology. So
+  "graceful shutdown within 30s" is precisely a *topology* property; the overall process exit could
+  still hang on the internal supervisor (Open Question).
+
+## Failure scenarios (Antithesis angle)
+
+- **SIGINT under bounded load:** send SIGINT (run.rs:284) while a modest, finite amount of data is
+  in flight. Expect: topology stops cleanly, `info!("All components stopped.")` logged, `Ok` result,
+  exit 0; forceful-stop warning NOT emitted. Sometimes(clean shutdown completed within 30s).
+- **SIGINT with a wedged downstream:** forwarder `dd_out` blocked on a dead/slow mock intake so a
+  component cannot drain within 30s. Expect: forceful stop path (running.rs:111-115) →
+  `Err("Topology failed to shutdown cleanly.")` → exit 1. Reachable("forced topology stop after
+  grace period"). This proves the timeout actually bounds shutdown time (no indefinite hang of the
+  topology).
+- **Unexpected component finish** (run.rs:280) then shutdown — same 30s path.
+- **Timing/interleaving:** Antithesis schedules so a component finishes right at the 30s boundary —
+  exercises the race between `join_next` and `shutdown_timeout`.
+
+## Config dependencies
+
+- 30s is hard-coded (run.rs:290) — not configurable. The internal supervisor children use their own
+  `ShutdownStrategy::Graceful(Duration::from_secs(5))` (supervisor.rs:125-126), distinct from the
+  topology's 30s.
+- `flush_open_windows` / aggregate flush behavior (SUT §3) affects how long the aggregate component
+  takes to finish on shutdown — interacts with timing but is owned (for data preservation) by the
+  data-loss agent.
+- Forwarder retry-queue disk flush on shutdown (SUT §2) can extend shutdown time / block draining.
+- Memory mode / backpressure (SUT §4) affects how much is in flight at shutdown.
+
+## Assertion (MISSING — net-new instrumentation)
+
+No Antithesis SDK assertions exist. Proposed SUT-side:
+- In `shutdown_with_timeout`, on the clean break (running.rs:90-93, "All components stopped"):
+  `assert_reachable!`/`Sometimes("topology shutdown completed cleanly")` and optionally record
+  elapsed since shutdown start to assert `<= 30s` (it is structurally bounded, but the assertion
+  documents intent).
+- On the forceful-stop branch (running.rs:111-115): `assert_reachable!("topology forcefully stopped
+  after grace period")` so the workload can confirm the timeout path is *reachable* under adversarial
+  load, and — under the **bounded-load** workload only — a workload-side
+  `AlwaysOrUnreachable`/Unreachable expectation that this branch is not taken.
+- Workload-side: on SIGINT under bounded load, assert process exits 0 within ~35s (30s grace + slack)
+  and that the "Forcefully stopping topology" warning is absent.
+
+## Open questions
+
+- **Does the overall process honor 30s, or can it still hang?** run.rs:294
+  `internal_supervisor_handle.await` has no timeout. If an internal-supervisor child hangs on
+  shutdown, the *process* exit can exceed 30s even though the *topology* shut down in time. WHY IT
+  MATTERS: a workload asserting "process exits within ~35s" might fail for a reason outside this
+  property's scope. WHAT CHANGES: either scope the assertion to topology-shutdown completion
+  (log/assertion inside `shutdown_with_timeout`) rather than process exit, or file a separate
+  property for internal-supervisor shutdown bounding.
+- **What happens to tasks dropped on forceful stop?** On running.rs:115 break, the `JoinSet` is
+  dropped, aborting remaining tasks. Confirm aborted component tasks cannot leave shared state
+  (interner buffer, retry queue) corrupted. WHY IT MATTERS: clean-in-time vs. data-integrity overlap
+  with the data-loss agent's property; keep the boundary clear.
+- **Is `shutdown_coordinator.shutdown()` idempotent / does cascade reliably reach every component?**
+  A source that never observes the shutdown signal would never finish and force the timeout. Needs a
+  read of `ComponentShutdownCoordinator` to confirm all edges are signaled.
+
+## SUT-side instrumentation needs
+
+- Antithesis SDK dependency (none today).
+- Reachable markers on both the clean-break and forceful-stop branches of `shutdown_with_timeout`
+  (running.rs:90-93 and 111-115).
+- Optional elapsed-time capture at shutdown start vs. completion to assert the 30s bound explicitly.
diff --git a/test/antithesis/scratchbook/properties/interner-full-bounded.md b/test/antithesis/scratchbook/properties/interner-full-bounded.md
new file mode 100644
index 00000000000..777aaeccd34
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/interner-full-bounded.md
@@ -0,0 +1,129 @@
+# interner-full-bounded
+
+**Family:** Resource Boundaries — memory / exhaustion
+**Status:** Verified against code at commit 042f41db3b. Two-mode property:
+- `allow_heap_allocs = false`: bounded + deterministic drop — **expected to HOLD**.
+- `allow_heap_allocs = true` (the DEFAULT): memory no longer bounded — **expected to FAIL** the
+  bounded-memory reading.
+
+## What led to the property
+
+Interner determinism is the foundation of the context memory bound (`sut-analysis.md` §3). The
+fixed-size interner has a hard byte capacity; the question is what happens when it fills. The
+behavior pivots entirely on one config flag, and the **default flips it to the unbounded
+branch** — making this the single highest-leverage config knob in the bounded-memory story.
+
+## Behavior in code
+
+Resolution path `ContextResolver::intern` (`lib/saluki-context/src/resolver.rs:339-353`):
+```rust
+s.try_cheap_clone()                                  // inlineable/cheap strings escape entirely
+ .or_else(|| self.interner.try_intern(s.as_ref())..) // fixed buffer; None when full
+ .or_else(|| self.allow_heap_allocations.then(|| {   // HEAP FALLBACK
+     self.telemetry.intern_fallback_total().increment(1);
+     MetaString::from(s.as_ref())                     // unbounded heap alloc
+ }))
+```
+
+- **Interner-full is deterministic.** `FixedSizeInterner` shard `try_intern`
+  (`lib/stringtheory/src/interning/fixed_size.rs:462-494`) returns `None` when neither a reclaimed
+  entry nor remaining buffer capacity can fit the string (`required_cap <= self.available()` else
+  `None`, lines 489-493). Also `None` if the string exceeds the packed length/capacity max
+  (lines 465-467). No allocation, no panic — just `None`.
+- **Heap-disallowed => metric dropped.** When `allow_heap_allocations == false`, the final
+  `or_else` yields `None`, so `intern` returns `None`, `create_context` returns `None`
+  (`resolver.rs:373` `let context_name = self.intern(name)?;`), `resolve` returns `None`, and
+  `handle_metric_packet` returns `None` (`sources/dogstatsd/mod.rs:1565-1580`): "We failed to
+  resolve the context, likely due to not having enough interner capacity." The metric is dropped
+  deterministically; no memory grows. This is exactly what the unit test
+  `no_metrics_when_interner_full_allocations_disallowed`
+  (`sources/dogstatsd/mod.rs:1808-1834`) asserts (using a noop/zero-size interner + a name longer
+  than the 31-byte inline limit so it can be neither inlined nor interned).
+- **Heap-allowed (DEFAULT) => unbounded.** `allow_heap_allocations` builder default is `true`
+  (`resolver.rs:258` `unwrap_or(true)`, doc `:179-190` calls it "effectively unlimited"), and DSD
+  config `dogstatsd_allow_context_heap_allocs` defaults `true`
+  (`sources/dogstatsd/mod.rs:149-151, 402-406`; wired at `sources/dogstatsd/resolver.rs:38,56,64`
+  for primary, no_agg, and tags resolvers, all sharing one interner at `resolver.rs:40`). On a
+  full interner every over-cap context spills to heap, bumping `intern_fallback_total`, and RSS
+  grows without bound.
+
+## Failure scenario (Antithesis)
+
+Set a small `dogstatsd_string_interner_size_bytes` and flood high-cardinality contexts so the
+interner fills.
+- Mode A (`dogstatsd_allow_context_heap_allocs: false`): assert that once the interner is full,
+  metrics with un-internable names/tags are dropped and **no heap fallback occurs** — memory
+  bounded. Antithesis timing exploration probes the interner reclamation/tombstoning path
+  (loom-tested per existing-assertions.md, raw-pointer `'static &str` keys) under concurrent
+  intern-vs-drop, where the worst documented case is a duplicate entry (more pressure), never
+  corruption.
+- Mode B (default `true`): assert `intern_fallback_total` climbs and RSS escapes the interner
+  budget — the bounded-memory guarantee is void. This is the more important finding because it is
+  the **default** posture.
+
+## Suggested assertions (NET-NEW — see existing-assertions.md: NO SDK assertions exist)
+
+- Heap-disallowed branch: `AlwaysOrUnreachable(interner_full ⇒ metric dropped, no heap alloc)`.
+  AlwaysOrUnreachable because "interner full" is a rare/optional path that may not occur in every
+  run; when it does occur the drop-not-allocate behavior must hold.
+- `Sometimes(try_intern returned None)` / `Sometimes(interner reported full)` — proves the
+  workload actually exhausts the interner; otherwise the above is vacuous.
+- Heap-allowed branch: `Sometimes(intern_fallback_total > 0)` — proves the unbounded spill path is
+  reachable under default config (the finding). Pair with the RSS check from
+  `rss-bounded-under-cardinality`.
+- A counter on `intern_fallback_total` already exists (`resolver.rs:349`) — a natural anchor for
+  the `Sometimes`, but it is telemetry, not an assertion, so the SDK assertion is still net-new.
+
+**SUT-side instrumentation strongly preferred:** distinguishing "interned" vs "inlined" vs
+"heap-fallback" vs "dropped" requires reading internal resolver state. A workload-only checker
+sees dropped metrics (missing at intake) but cannot tell a heap-fallback (memory bug) from a
+clean intern (correct), nor an interner-full drop from a parse drop.
+
+## Configuration dependencies
+
+- `dogstatsd_allow_context_heap_allocs` (default **true** — the unbounded branch).
+- `dogstatsd_string_interner_size_bytes` / `dogstatsd_string_interner_size` (interner capacity;
+  effective default 2 MiB, `sources/dogstatsd/mod.rs:1888-1890`). Both resolvers share one
+  interner of this size (`sources/dogstatsd/resolver.rs:40`), so two resolvers draw from the same
+  buffer.
+- Inlining: strings <= 31 bytes are inlined by `MetaString` (`try_cheap_clone`) and bypass the
+  interner entirely — the workload must use long names/tags to actually pressure the interner.
+
+## Open questions
+
+- The fixed-size interner reclaims/tombstones entries when interned strings are dropped. Under
+  steady high-cardinality churn, does reclamation keep pace, or does fragmentation make
+  `try_intern` return `None` even below nominal byte capacity? Matters: if fragmentation causes
+  premature "full," the heap-disallowed mode drops metrics earlier than the configured budget
+  implies (more data loss), and the heap-allowed mode spills sooner.
+- The tags resolver also has `with_heap_allocations` (`sources/dogstatsd/resolver.rs:45`) and
+  shares the interner. Is the bound the sum across name+tag interning of both resolvers? Affects
+  the byte budget the assertion measures against.
+
+## Investigation Log
+
+#### Default of `dogstatsd_allow_context_heap_allocs` and whether bounded mode is ever shipped
+- **Examined**: `lib/saluki-components/src/sources/dogstatsd/mod.rs:149-151`
+  (`default_allow_context_heap_allocations`), `:403-406` (serde field + rename), `:438`;
+  `sources/dogstatsd/resolver.rs:38,45,56,64` (resolver wiring); `lib/saluki-context/src/resolver.rs`
+  `with_heap_allocations` (187-188) and the default fallback `.unwrap_or(true)` (258, 663);
+  `config_registry/datadog/dogstatsd.rs:8,392`; grep of all `with_heap_allocations(false)` in non-test code;
+  searched shipped configs (`dist/`, `config/`, all `*.yaml`/`*.toml`) for `heap_alloc`.
+- **Found (a) — default**: `const fn default_allow_context_heap_allocations() -> bool { true }`
+  (`mod.rs:149-151`), applied via `#[serde(rename = "dogstatsd_allow_context_heap_allocs",
+  default = "default_allow_context_heap_allocations")]` (`mod.rs:403-406`). The resolver builder
+  default also independently falls back to `true` (`resolver.rs:258` `unwrap_or(true)`). So default
+  is **true = unbounded heap-allocation (spill) mode**.
+- **Found (b) — bounded mode is test-only**: The only call sites of `with_heap_allocations(false)`
+  are inside `#[cfg(test)] mod tests` in `sources/dogstatsd/mod.rs` (lines 1820 and 1841, module
+  begins `#[cfg(test)]` at 1736 — tests `no_metrics_when_interner_full_allocations_disallowed`
+  and `metric_with_additional_tags`). Production wiring (`resolver.rs:38,45,56,64`) passes
+  `config.allow_context_heap_allocations` straight through. No shipped YAML/TOML config sets it to
+  false. There is no default/code path that forces bounded mode; it is **opt-in via config only**.
+- **Not found**: Any release/default config or non-test code path that sets
+  `dogstatsd_allow_context_heap_allocs = false`.
+- **Conclusion**: RESOLVED. Default is **true** (unbounded spill). Bounded mode (heap-disallowed,
+  interner-is-a-hard-cap) is **opt-in / test-only** — never shipped by default. The
+  realistic default-config property is "interner spills to heap (memory unbounded by the interner)";
+  the hard-bounded property only holds when an operator explicitly sets the flag false. Property
+  framing should remain explicitly two-mode and label the bounded branch as opt-in.
diff --git a/test/antithesis/scratchbook/properties/interner-reclamation-no-corruption.md b/test/antithesis/scratchbook/properties/interner-reclamation-no-corruption.md
new file mode 100644
index 00000000000..9bdb9ef4b32
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/interner-reclamation-no-corruption.md
@@ -0,0 +1,128 @@
+---
+slug: interner-reclamation-no-corruption
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Safety
+priority: High
+assertion_status: MISSING (net-new instrumentation)
+---
+
+# Property: Concurrent intern + drop-last-ref never yields overlapping/corrupt entries
+
+## Origin
+SUT analysis §3 ("manual reclamation/tombstoning"), §4 ("Interner reclamation is loom-tested;
+worst case is a duplicate entry, never corruption … the most concurrency-unsafe component in the
+bounded-memory story"). existing-assertions.md notes a `loom` cfg already marks this path as
+concurrency-sensitive. No Antithesis assertion exists.
+
+## What the code does
+`lib/stringtheory/src/interning/fixed_size.rs` + `map.rs` — a fixed byte buffer with manual
+refcount-based reclamation. The safety argument rests on a mutex + a refcount **re-check**:
+
+- `EntryHeader.refs: AtomicUsize` (fixed_size.rs:92-95). `increment_active_refs` uses `AcqRel`
+  (212-214); `decrement_active_refs` (219-221) returns true iff `fetch_sub(1, AcqRel) == 1`
+  (i.e. count hit zero). `is_active` loads `Acquire` (207-209).
+- Drop (map.rs:73-84): when `decrement_active_refs()` says "now zero," takes
+  `interner.lock()` and calls `mark_for_reclamation(self.header)`.
+- **The re-check under the lock** (map.rs:447-470 `InternerState::mark_for_reclamation`): re-reads
+  `header.is_active()` (459). If a concurrent `try_intern` resurrected the entry (incremented refs)
+  between the drop's decrement and acquiring the lock, `is_active()` is now true and reclamation is
+  **skipped**. Only if still inactive does it `entries.remove(entry_str)` (466) then
+  `storage.mark_for_reclamation` (468). Comment (450-454) states only the mutex-mediated `InternerState`
+  increments refs and only the handle drop decrements, so a zero count under the lock means nobody else
+  holds or is acquiring a reference.
+- **The buffer is overwritten on reclaim**: `write_to_reclaimed_entry`/the fill at map.rs:382-393 fills
+  the reclaimed string capacity with `0x21` ("a known repeating value … signal that offsets/reclaimed
+  entries are incorrect and overlapping"). So a stale `'static &str` pointing into a reclaimed slot would
+  read `0x21` bytes — the corruption sentinel.
+- `try_intern` (map.rs:472-517): under the lock, first checks `entries.get(s)` and on hit
+  `increment_active_refs` (483-484) — this is the resurrection that the drop re-check defends against.
+  On miss, reuses a reclaimed entry (495-497) or unoccupied space (498-499), inserts a `'static`-lifetime
+  key (513-514, with a SAFETY note that the lifetime never outlives the entry).
+- **Loom test** `concurrent_drop_and_intern` (fixed_size.rs:1072-1142): models T1 holding an entry, T2
+  interning the same string, then `drop(t1)` — asserts ≤1 reclaimed entry and that the reclaimed entry
+  does **not overlap** the surviving interned string (`do_reclaimed_entries_overlap`, 1078-1086). The
+  documented acceptable outcome is a benign duplicate (1090-1094, 1117-1121).
+
+## Failure scenario (Antithesis)
+High-cardinality DSD load with many short-lived contexts so the same tag/name strings are repeatedly
+interned and dropped across the source's context resolvers, driving concurrent `try_intern` (resolve)
+against drop-last-ref reclamation on a near-full interner (forcing reclaimed-slot reuse + buffer
+overwrite). The hazard: a `try_intern` that returns a handle into a slot a concurrent drop then reclaims
+and overwrites with `0x21`, producing an interned `&str` whose bytes are corrupt or overlap another live
+entry.
+
+## Key observations
+- The loom test **bounds** the interleavings (loom explores a model with a small, fixed thread/op set);
+  Antithesis explores the real scheduler under real load with many shards and entries — the SUT analysis
+  explicitly flags this as where Antithesis adds value beyond loom.
+- The invariant to check is exactly the loom assertion generalized: **no reclaimed entry overlaps a live
+  entry**, and **no live `InternedString` reads the `0x21` corruption sentinel**. The `0x21` fill is a
+  ready-made detector: a resolved string containing the fill pattern where it shouldn't is corruption.
+- Sharding (`[Arc<Mutex<…>>; SHARD_FACTOR]`, SUT §3) means cross-shard interactions add interleavings loom
+  doesn't model per-run.
+
+## Config deps
+- Interner capacity / `allow_heap_allocations` (SUT §3): with heap fallback on (default true), a full
+  interner spills to heap and the reclaimed-slot-reuse path is less pressured — to exercise reclamation,
+  the test wants a **small** interner and/or heap-fallback off so the buffer actually fills and reclaims.
+- `SHARD_FACTOR` and per-shard capacity govern how often reclaimed slots are reused.
+
+## Suggested assertion (MISSING — net-new)
+- **Always**(no corrupt/overlapping entry): generalize the loom check to a runtime invariant. Realize as
+  an SDK `assert_always` (or `assert_unreachable` on the corruption-detected branch) inside
+  `mark_for_reclamation`/`try_intern` that verifies a newly returned entry does not overlap any reclaimed
+  entry and that resolved bytes are not the `0x21` sentinel. This needs SUT-side instrumentation — the
+  race is invisible to a workload-only checker.
+- **Sometimes(reclaimed-slot reused)** and **Sometimes(drop re-check found resurrected entry)**: prove the
+  contended reclamation path was actually hit (the `is_active()` re-check at map.rs:459 returning true),
+  not just steady-state interning.
+
+## SUT-side instrumentation needs
+- A debug-build check that scans for overlap between `reclaimed` entries and the live `entries` map, or a
+  per-resolve check that the returned `&str` contains no unexpected `0x21` run, gated behind a test cfg.
+  A workload cannot see interner internals; only SUT-side assertions can catch the corruption branch.
+
+## Open questions
+- **Memory ordering sufficiency:** the re-check relies on `AcqRel`/`Acquire` pairing between
+  `increment_active_refs` and the drop's `decrement` + `is_active` under the lock. Confirm the lock
+  acquire provides the needed synchronization with a `try_intern` increment that happened *without* the
+  drop's lock (the increment at map.rs:484 is under the same `InternerState` lock — verify both paths take
+  the same mutex so the re-check is sound). If both are under the lock, the race window is only between the
+  atomic decrement (no lock) and acquiring the lock — which is exactly what the re-check covers.
+- **Cross-shard handles:** can an `InternedString` from shard A ever be dropped against shard B's lock?
+  If sharding is by string hash and stable, no — but confirm, since a wrong-shard reclaim would be
+  corruption the per-shard check wouldn't catch.
+
+### Investigation Log
+
+#### Is the reclamation buffer-fill sentinel present in RELEASE builds, or debug-only?
+- **Examined:** `lib/stringtheory/src/interning/map.rs:368-394` (`clear_reclaimed_entry`),
+  `lib/stringtheory/src/interning/fixed_size.rs:435-460` (the analogous reclaim path), and grepped the
+  whole `interning/` dir for `0x21` / `fill` / `debug_assert` / `cfg!` / `debug_assertions`.
+- **Found:**
+  - **Exact fill site (map.rs):** `map.rs:392` — `str_buf.fill(0x21);` inside the `unsafe` block at
+    `map.rs:388-393`, within `fn clear_reclaimed_entry` (`map.rs:368`). It fills the entire string
+    capacity of the tombstoned entry (`str_ptr = entry_ptr.add(1).cast::<u8>()`, length `str_cap`).
+  - **No cfg gate of any kind.** There is no `#[cfg(debug_assertions)]`, no `if cfg!(debug_assertions)`,
+    no `#[cfg(test)]`, no `#[cfg(loom)]` around the fill or around `clear_reclaimed_entry`. The fill is
+    unconditional and therefore **present in release builds**. The only cfg-gated constructs in these
+    files are `debug_assert!` macros (fixed_size.rs:278/286/325/390, map.rs:408) which are unrelated to
+    the fill.
+  - **Important discrepancy — two different sentinels in two different interner implementations:**
+    - `map.rs:392` (the `InternerState`/`Map`-backed interner) fills with **`0x21`** (ASCII `!`).
+    - `fixed_size.rs:458` (the `FixedSizeInterner` reclaim path, `fn` at fixed_size.rs ~430) fills with
+      **`0xAA`** — *not* `0x21`. Same surrounding comment ("Write a magic value … signal that
+      offsets/reclaimed entries are incorrect and overlapping"), same unconditional `unsafe { … fill() }`,
+      but a different byte value. Both are unconditional / release-present.
+- **Not found:** No conditional compilation, feature flag, or runtime toggle disabling either fill. No
+  third fill site.
+- **Conclusion:** RESOLVED. The buffer-fill-on-reclamation is unconditional and present in release builds,
+  so an Antithesis assertion *can* rely on a fill sentinel to detect a stale read into a reclaimed slot
+  rather than being forced to compute overlap directly. **However, the sentinel value is implementation-
+  dependent:** `0x21` for the `map.rs` interner, `0xAA` for the `fixed_size.rs` interner. An assertion
+  that hard-codes `0x21` would miss corruption in the `FixedSizeInterner` path. The robust check is either
+  (a) match against the correct sentinel per implementation, or (b) check overlap directly (the
+  implementation-independent invariant the loom test already uses). Detecting a *run* of either sentinel
+  in a resolved `&str` is a valid corruption signal in the corresponding interner.
diff --git a/test/antithesis/scratchbook/properties/malformed-dsd-no-crash.md b/test/antithesis/scratchbook/properties/malformed-dsd-no-crash.md
new file mode 100644
index 00000000000..557b3780c75
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/malformed-dsd-no-crash.md
@@ -0,0 +1,98 @@
+---
+slug: malformed-dsd-no-crash
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Safety
+priority: High
+assertion_status: MISSING (net-new instrumentation)
+---
+
+# Property: Malformed DogStatsD packets never crash the process or kill the socket
+
+## Origin
+SUT analysis §2 ("a malformed packet never kills the socket", `mod.rs:1283-1318`), §6 gap
+#6, §8 ("undecided malformed-input error policy in the codecs"). Decomposes the headline
+"ADP will not crash under load." No Antithesis assertion exists (existing-assertions.md).
+
+## What the code does
+`lib/saluki-components/src/sources/dogstatsd/mod.rs::drive_stream` (1146-1337):
+- The `'read` loop reads into an I/O buffer, then a `'frame` loop calls
+  `framer.next_frame(io_buffer, reached_eof)` (1237) and `handle_frame(...)` (1252).
+- **Per-frame parse error** (`handle_frame` returns `Err`, 1266-1270): logged at `warn!`
+  and the loop continues — the bad frame is skipped, not fatal.
+- **Framing error** (1272-1293): increments `framing_errors`; for **connectionless** streams
+  (UDP, UDS datagram) `io_buffer.clear()` + `continue 'read` (1283-1288) — clear-and-continue,
+  the socket survives. For **connection-oriented** streams it `break 'read` (1289-1291), closing
+  only that one connection (not the process).
+- **I/O error** (1306-1318): connectionless → `continue 'read`; connection-oriented → `break 'read`.
+- `handle_frame` (1458-1542) routes to `codec.decode_packet`; on decode error it bumps a
+  type-specific `*_decode_failed` counter and returns `Err` (1462-1473) — caught by the caller above.
+
+Codecs (`lib/saluki-io/src/deser/codec/dogstatsd/`):
+- `metric.rs::parse_dogstatsd_metric` (67-194): `nom` parsers returning `IResult`; unknown
+  `|`-chunks are silently skipped (136-141, with a TODO "throw an error, warn, or be silently
+  permissive?"). `permissive_metric_name` (197-206) uses `from_utf8_unchecked` but only after
+  `take_while1(valid_char)` constrains bytes to printable ASCII (SAFETY comment). `raw_metric_values`
+  validates UTF-8 with `simdutf8` before any `from_utf8_unchecked` (232-234).
+- `event.rs:146-148` and `service_check.rs:94-96`: identical "skip unknown chunk" TODO — **undecided
+  error policy**, currently silently permissive.
+- `metric.rs:243` `unreachable!("should be constrained by alt parser")` — reachable only if the
+  `alt((tag("g"),tag("c"),…))` matched something not in the match arm; provably constrained, so not
+  a real panic site, but it *is* an `unreachable!` on the hot parse path worth covering.
+- Existing proptest `property_test_malicious_input_non_exhaustive` (metric.rs:761-772): 1000 random
+  byte vectors, asserts no panic. This is a `cargo test` proptest, **not** an Antithesis assertion,
+  and is non-exhaustive by its own comment.
+
+## Failure scenario (Antithesis)
+Drive each listener (UDP 8125, TCP, UDS datagram, UDS stream) with adversarial packets:
+oversized frames (exceed buffer → framing error), invalid UTF-8 in value/name positions,
+truncated extensions (`|@`, `|#`, `c:`, `e:`, `card:` with missing bodies), enormous tag lists,
+embedded NULs, partial multi-value (`x:1:2:`), zero-length frames. Expectation: process stays
+up; connectionless sockets keep serving subsequent valid packets; a TCP connection may close but
+the listener accepts new connections; no panic.
+
+## Key observations
+- The clear-and-continue (1283-1288) and skip-bad-frame (1266-1270) paths are the explicit
+  socket-survival mechanism — the property is precise: **connectionless sockets never die on a bad
+  packet; the process never panics on codec/framing errors.**
+- The codecs return `Err` rather than panicking for malformed input by construction (nom + guarded
+  `unsafe`), but the `unreachable!`/`from_utf8_unchecked` sites mean a *parser regression* could turn
+  malformed input into a panic — exactly what Antithesis should guard.
+- TCP `break 'read` closes one connection; that is acceptable (connection-oriented semantics) and must
+  be excluded from a "socket never dies" claim — scope the listener-survival half to connectionless.
+
+## Config deps
+- `permissive` mode (metric.rs:73) widens accepted metric names — broadens the malformed surface;
+  test both permissive and strict.
+- `client_origin_detection`, `timestamps` gates (metric.rs:117/122/127/132) change which extension
+  chunks are parsed; toggling them changes the reachable parse branches.
+- Which listeners are enabled is config-driven; the property should hold for every enabled listener.
+
+## Suggested assertion (MISSING — net-new)
+- **Always**(process up): the ADP process stays alive across the entire malformed-input workload —
+  realized as a workload-side liveness/health check plus a panic hook converting any codec/framing
+  panic into a recorded failure.
+- **Unreachable** at codec panic: `assert_unreachable` covering the `unreachable!` (metric.rs:243) and
+  the `from_utf8_unchecked` SAFETY invariants — any panic there is a must-never.
+- **Always**(connectionless socket survives): after a malformed packet on UDP/UDS-datagram, the same
+  socket successfully receives a subsequent valid packet (`packet_receive_success` increments again).
+- **Sometimes(framing_errors > 0)** and **Sometimes(*_decode_failed > 0)**: prove the adversarial
+  input actually reached the error paths, not that the workload was too benign.
+
+## SUT-side instrumentation needs
+- A panic during parse is invisible to a workload-only checker until the process dies; pair a panic
+  hook / `assert_unreachable` at the codec sites with workload-side liveness. `framing_errors` and
+  `*_decode_failed` counters already exist for the `Sometimes` reachability anchors.
+
+## Open questions
+- **What should the codec error policy be for unknown/trailing chunks** (the four TODOs)? Currently
+  silently permissive. If the policy changes to "error," more inputs become `Err` (still no crash) but
+  more packets are dropped — affects the data-loss family, not this no-crash property. Worth resolving
+  before finalizing the assertion so the test's expected-drop accounting is stable.
+- **Does a malformed packet ever cause a *partial* dispatch** that mis-routes remaining events (SUT §7
+  #6, mod.rs dispatch-error swallow)? That is a separate routing-correctness property; confirm it is
+  not conflated with no-crash.
+- **Is there a max frame/datagram size that, when exceeded, the framer handles gracefully on every
+  transport** (vs. only connectionless)? Confirm TCP oversized-frame handling does not wedge the
+  connection (it `break`s, but verify no resource leak per closed connection).
diff --git a/test/antithesis/scratchbook/properties/malformed-event-sc-no-crash.md b/test/antithesis/scratchbook/properties/malformed-event-sc-no-crash.md
new file mode 100644
index 00000000000..749ccd933da
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/malformed-event-sc-no-crash.md
@@ -0,0 +1,99 @@
+---
+slug: malformed-event-sc-no-crash
+title: Malformed DSD event / service-check payloads never crash process or socket
+type: Safety
+priority: High
+status: net-new (no SDK assertion exists)
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+---
+
+# malformed-event-sc-no-crash
+
+## Origin
+Coverage gap: the existing catalog's untrusted-input property `malformed-dsd-no-crash`
+(`property-catalog.md` Category E) is scoped to the **metric** codec + framing. It does NOT
+exercise the two separate, always-on event / service-check codecs that parse untrusted bytes
+on every DSD listener whenever DogStatsD is enabled (`run.rs:681-684`). These codecs are
+~394 LOC (`event.rs`) and ~312 LOC (`service_check.rs`) of net-new nom parsing with their own
+length-prefix, UTF-8, timestamp, and extension-chunk logic that the metric path never touches.
+
+## Code paths (file:line)
+- Event codec entry: `lib/saluki-io/src/deser/codec/dogstatsd/event.rs:31` `parse_dogstatsd_event`.
+  - Length-prefixed body: `event.rs:36-49` parses `_e{<title_len>,<text_len>}:` then
+    `take(title_len)` / `take(text_len)` — **attacker-controlled lengths**. nom `take` on a short
+    buffer returns `Err`, not a panic (good), but this is the untrusted-length surface to fuzz.
+  - UTF-8 validation + `.replace("\\n","\n")` allocation: `event.rs:51-59` (per-packet heap alloc
+    keyed on attacker length — memory-amplification angle under flood).
+  - Extension chunk loop: `event.rs:79-156` — `all_consuming` sub-parsers for `d:`(timestamp),
+    `h:`,`k:`,`p:`,`s:`,`t:`,`c:`,`e:`,`card:`,`#tags`. `unix_timestamp` parser at
+    `event.rs:88`; cardinality at `event.rs:133-139`. Unknown chunks silently skipped
+    (`event.rs:145-149`, TODO: "throw an error, warn, or be silently permissive?").
+- Service-check codec entry: `lib/saluki-io/src/deser/codec/dogstatsd/service_check.rs:28`
+  `parse_dogstatsd_service_check`.
+  - `_sc|<name>|<status:u8>` via `parse_u8` + `CheckStatus::try_from` (`service_check.rs:31-38`).
+  - Extension loop `service_check.rs:48-104`: `d:`,`h:`,`c:`,`e:`,`#tags`,`m:`(utf8 message),`card:`.
+    Same silent-skip TODO (`service_check.rs:93-97`).
+- Decode dispatch + error counting: `lib/saluki-components/src/sources/dogstatsd/mod.rs:1462-1474`
+  (`handle_frame` → `codec.decode_packet`); decode failure increments
+  `event_decode_failed()` / `service_check_decode_failed()` (`mod.rs:1468-1469`,
+  counters defined `sources/dogstatsd/metrics.rs:58-63`) and returns `Err(ParseError)`.
+- Socket-survival mechanism (shared with metrics): connectionless framing/parse errors are
+  logged + the buffer cleared + loop continues (`sut-analysis.md` §2, `mod.rs:1283-1318`); a bad
+  event/SC frame must not kill the socket or process.
+
+## Failure scenario
+An adversarial event/SC payload triggers a panic or unbounded resource use in the dedicated codec
+(e.g. a parser path the non-exhaustive unit tests miss: pathological length prefixes, invalid UTF-8
+in title/text/message, malformed `d:` timestamp, `card:` parsing, multibyte boundary in
+`.replace`). Because these codecs are entirely separate from the metric codec, the existing
+`malformed-dsd-no-crash` coverage gives no assurance here. A panic on any DSD listener thread is a
+process crash (data-plane components are fail-stop, `sut-analysis.md` §2); a crash-loop under a
+malformed-event flood violates the headline "won't crash under load" guarantee.
+
+## Observations
+- Both codecs return `nom::Err` on bad input rather than panicking in the paths read; no `unwrap`/
+  `expect`/`unsafe` was seen in `event.rs` or `service_check.rs` themselves. The risk is (a) shared
+  helper parsers (`helpers::*` — `unix_timestamp`, `tags`, `cardinality`, `ascii_alphanum_and_seps`,
+  `local_data`, `external_data`) and (b) the `.replace` allocations under flood. Helpers were not
+  fully read — see Open Questions.
+- `title_len == 0 || text_len == 0` is rejected (`event.rs:44-46`), but huge declared lengths just
+  fail the `take` — confirm no pre-allocation on the declared length.
+- Error policy for unknown trailing chunks is undecided (silently permissive) in BOTH codecs — same
+  open policy question as the metric path; affects expected-drop accounting, not no-crash.
+
+## Suggested assertions (MISSING / net-new — no Antithesis SDK in tree per `existing-assertions.md`)
+- `Always(process_up)` and `Always(connectionless socket survives a bad event/SC packet)` — extends
+  the metric-only `malformed-dsd-no-crash` to the event/SC frames; can reuse the same process-up /
+  socket-survival workload checker but MUST drive event/SC frames specifically.
+- SUT-side `Unreachable` at any panic site reachable from `parse_dogstatsd_event` /
+  `parse_dogstatsd_service_check` and their helpers (none confirmed yet — guards regressions and the
+  shared-helper risk).
+- `Sometimes(event_decode_failed > 0)` and `Sometimes(service_check_decode_failed > 0)` — reachability
+  anchors proving the malformed-event/SC parse-error paths are actually exercised (avoids vacuity;
+  these counters already exist at `metrics.rs:58-63`).
+
+## Config dependencies
+- DogStatsD enabled (`data_plane.enabled: true`); events/service_checks sub-pipelines are on by
+  default (`EnablePayloadsConfiguration` defaults `events: true`, `service_checks: true`,
+  `sources/dogstatsd/mod.rs:205-221`).
+- `client_origin_detection` gates the `c:`/`e:`/`card:` extension parsers (`event.rs:122-139`,
+  `service_check.rs:68-92`); toggling it changes which parser branches untrusted bytes reach. Drive
+  BOTH settings to cover the gated parsers.
+
+## SUT-side instrumentation needs
+- A process-up / socket-alive workload checker (shared with `malformed-dsd-no-crash`) plus
+  event/SC-shaped malformed frames in the workload generator.
+- The two `Sometimes` anchors read existing decode-failure counters; the `Unreachable` panic guard,
+  if added, needs an SDK assertion compiled into the codec/helpers (net-new dependency).
+
+## Open Questions
+- Do the shared `helpers::*` parsers (`unix_timestamp`, `tags`, `cardinality`, `local_data`,
+  `external_data`, `ascii_alphanum_and_seps`, `utf8`) contain any panic/`unwrap`/pre-allocation on
+  attacker-controlled length? Not yet read — pivotal for whether the `Unreachable` guard is needed.
+- Does `take(title_len)`/`take(text_len)` (or the message `utf8` parser) ever pre-allocate on the
+  declared length before validating the buffer is long enough (a memory-amplification vector under
+  flood)?
+- Is a malformed event/SC frame ever mis-classified by `parse_message_type` (`mod.rs:1466`) such that
+  the wrong decode-failure counter increments — cosmetic, but affects the `Sometimes` anchor wiring.
diff --git a/test/antithesis/scratchbook/properties/mapper-interner-bounded.md b/test/antithesis/scratchbook/properties/mapper-interner-bounded.md
new file mode 100644
index 00000000000..2a1bb6ed4ac
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/mapper-interner-bounded.md
@@ -0,0 +1,104 @@
+# mapper-interner-bounded
+
+## Origin
+
+Coverage-gap analysis. The catalog's `interner-full-bounded` covers the **DSD source's** context
+interner. The `dogstatsd_mapper` carries a **second, independent** string interner (a whole separate
+`ContextResolver` built inside the mapper, default 64 KiB) that interns the *mapped* names and the
+*expanded* tags. When that interner is full, the mapper's `try_map` returns `None` and the metric is
+**silently left un-remapped** — it flows downstream under its *original* (unmapped) name/tags. This is
+a distinct, uncovered silent-failure surface: a second bounded resource with its own
+saturation-drop behavior, on a transform that claims Agent equivalence.
+
+## Code paths
+
+- `lib/saluki-components/src/transforms/dogstatsd_mapper/mod.rs`
+  - Interner built at `mod.rs:158-167`:
+    `ContextResolverBuilder::from_name("…/dsd_mapper/primary")…with_interner_capacity_bytes(64 KiB
+    default)…build()`. **It does NOT call `with_heap_allocations(false)`**, so heap fallback defaults
+    to `true` (`resolver.rs:258`). Default size = `default_context_string_interner_size` = `ByteSize::kib(64)`
+    (`mod.rs:34-36`), key `dogstatsd_mapper_string_interner_size` (`mod.rs:51-55`).
+  - Slow-path resolve: `self.context_resolver.resolve_with_origin_tags(new_name.as_str(), merged_tags,
+    origin_tags.clone())?` (`mod.rs:317-321`). The trailing `?` means **when resolution returns
+    `None`, `try_map` returns `None`** → the caller does not replace the context.
+  - Cache-hit path resolves too: `resolve_with_origin_tags(result.name.clone(), merged_tags, …)`
+    (`mod.rs:277-282`) — returns the `None` directly. So even a cached positive result can fail to
+    materialize a context if the interner is full at apply time.
+  - Caller: `DogStatsDMapper::transform_buffer` (`mod.rs:388-398`) — `if let Some(new_context) =
+    try_map(...) { *metric.context_mut() = new_context; }`. **No `else`** → on `None` the metric keeps
+    its original context silently. No drop, no `events_discarded`, no dedicated counter.
+- Resolution `None` semantics: `resolve_inner` → `create_context` returns `None` when name/tag
+  interning fails and heap is disallowed (`resolver.rs:436-483`, name interned at the
+  `try_intern…or_else(allow_heap_allocations.then(...))` site `resolver.rs:346-349`).
+
+## Failure scenario
+
+Two distinct modes:
+
+1. **Default config (heap fallback ON):** under a high-cardinality flood of mappable names, the mapper
+   interner never returns `None` but spills mapped names/tags to the heap — the mapper's declared
+   64 KiB bound is voided and memory grows unbounded (parallels `interner-full-bounded`'s
+   default-config failure, but for a *second* interner the firm bound accounts for at
+   `mod.rs:374-375`).
+2. **Heap fallback OFF (if the operator disables it):** the mapper interner fills; `try_map` returns
+   `None`; the metric is **forwarded under its original, unmapped name/tags**. Downstream filters
+   (`dsd_prefix_filter`, `dsd_tag_filterlist`, `dsd_post_agg_filter`) then make decisions on the wrong
+   name, and the customer sees the pre-mapping identity — a silent correctness divergence from the
+   Agent, not just a dropped metric. Behavior is non-deterministic across the cardinality of *mapped*
+   strings, independent of the source interner.
+
+## Property
+
+- **Type:** Safety. Heap-OFF: the silent-non-remap is a correctness hazard to surface. Heap-ON
+  (default): bounded-memory claim fails by design (mirrors `interner-full-bounded`).
+- **Invariant:**
+  - Heap-OFF: `AlwaysOrUnreachable(mapper interner full ⇒ metric forwarded UNDER ORIGINAL context,
+    accounted)` — i.e. the silent-non-remap must be observable/counted, never a silent partial-map.
+    `Sometimes(mapper resolve == None)` proves exhaustion is reached.
+  - Heap-ON (default): `Sometimes(mapper intern heap fallback > 0)` proves the unbounded spill path is
+    reachable for the *mapper's own* interner.
+  - SUT-side instrumentation required to distinguish mapper-interned / heap-fallback / resolve-None /
+    forwarded-original — none of these has a metric today (the firm-bound accounting at
+    `mod.rs:367-382` is a static declaration, not a runtime counter).
+- **Antithesis angle:** small `dogstatsd_mapper_string_interner_size` + a flood of *distinct mappable*
+  names (each expands to a unique mapped name + tags) fills the mapper interner specifically; combine
+  with the source-interner flood (`interner-full-bounded`) to show the two interners saturate
+  independently. Timing/scheduling exploration stresses the resolver under churn (idle-context
+  expiration is 30s, `mod.rs:166`).
+- **Priority:** High.
+
+## Config dependencies
+
+- `dogstatsd_mapper_string_interner_size` (default 64 KiB) — shrink to force exhaustion cheaply.
+- `dogstatsd_allow_context_heap_allocs` — note this is the **DSD source** key; the mapper interner
+  does **not** read it (it never sets `with_heap_allocations`), so the mapper always defaults to
+  heap-ON unless the resolver default changes. Confirm there is no separate mapper heap flag (there is
+  not in current source). This asymmetry is itself a finding.
+- `dogstatsd_mapper_profiles` must be set (a profile must match) for the mapper interner to be
+  exercised at all.
+- `dogstatsd_mapper_cache_size` (default 1000): a cached positive result still re-resolves
+  (`mod.rs:277-282`), so the interner can fail even on a cache hit.
+
+## Open Questions
+
+- Is the mapper's lack of a `with_heap_allocations(false)` option intentional, or an oversight that
+  makes the mapper interner's declared 64 KiB firm bound (`mod.rs:374-375`) unenforceable under
+  default behavior? `(needs human input)`
+- The cache stores the mapped `name`/`extra_tags` but resolution can still fail at apply time
+  (`mod.rs:277-282`): does that mean a metric can be remapped on one call and silently NOT remapped on
+  the next, for the *same* name, purely due to interner pressure? That would be a non-deterministic,
+  load-dependent identity flip — needs confirmation under churn.
+- Does the Datadog Agent mapper have an analogous bounded interner with the same drop-to-original
+  behavior, or does it always allocate? Determines whether heap-OFF behavior is an ADP-specific
+  divergence (ties to `mapper-output-matches-agent`).
+
+### Investigation Log
+
+- Examined: `dogstatsd_mapper/mod.rs` build (`158-183`), `try_map` (`259-342`), `transform_buffer`
+  (`388-398`), `specify_bounds` (`367-382`); `resolver.rs:251-299,334-349,436-483` for the resolver's
+  default `allow_heap_allocations=true` and the `Option<Context>` `None` path.
+- Found: the mapper instantiates a fully separate `ContextResolver` with its own 64 KiB interner and
+  the default heap-ON behavior; on resolve-`None` the metric is silently forwarded under its original
+  context with no counter. This is a genuinely second interner-full surface, distinct from
+  `interner-full-bounded` (DSD source) — different resource, different downstream consequence
+  (silent non-remap vs. dropped context).
diff --git a/test/antithesis/scratchbook/properties/mapper-output-matches-agent.md b/test/antithesis/scratchbook/properties/mapper-output-matches-agent.md
new file mode 100644
index 00000000000..121d13440f5
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/mapper-output-matches-agent.md
@@ -0,0 +1,98 @@
+# mapper-output-matches-agent
+
+## Origin
+
+Coverage-gap analysis: the existing 27-property catalog frames ADP as a *transport* and a single
+aggregation *transformer* (`aggregate-matches-agent`), but the DogStatsD transform chain has four
+additional correctness-affecting transforms that all claim Datadog-Agent equivalence and none has a
+property. The `dogstatsd_mapper` is the most complex: it rewrites the metric **name** and injects new
+**tags** by expanding regex capture groups, with its own result cache and its own string interner.
+A divergence from the Agent's mapper is silent, customer-visible data corruption (wrong metric name,
+wrong/missing tags) that the happy-path `panoramic` diff suite does not target as a mapper-specific
+case.
+
+## Code paths
+
+- `lib/saluki-components/src/transforms/dogstatsd_mapper/mod.rs`
+  - `MetricMapper::try_map` (`mod.rs:259-342`) — slow path iterates profiles, runs each `Regex`,
+    on a match clears `new_name` and calls `captures.expand(&mapping.name, &mut new_name)`
+    (`mod.rs:298-299`), then for each configured tag does
+    `captures.expand(tag_value_expr, &mut expanded_tag_value)` (`mod.rs:302-309`).
+  - Profile selection: `metric_name.starts_with(&profile.prefix) || profile.prefix == "*"`
+    (`mod.rs:292`); **first matching profile + first matching mapping wins** (tests
+    `test_wildcard_prefix_order` `mod.rs:781`, `test_multiple_profiles_order` `mod.rs:821`).
+  - Wildcard→regex compilation: `build_regex` (`mod.rs:186-215`) escapes `.` → `\.`, turns `*` →
+    `([^.]*)`, anchors `^…$`; rejects chars outside `[a-zA-Z0-9\-_*.]` and consecutive `**`.
+  - Existing tags are preserved and merged with expanded tags (`merge_shared`, `mod.rs:314-315`;
+    test `test_retain_existing_tags` `mod.rs:888`).
+- Pipeline placement (`bin/agent-data-plane/src/cli/run.rs:640-641,674-675`): the mapper is the
+  *first* transform in the `dsd_enrich` chained transform, ahead of `dsd_prefix_filter`,
+  `dsd_tag_filterlist`, `dsd_agg`, `dsd_post_agg_filter`. So a mapper rename changes which
+  prefix/blocklist/filterlist rules subsequently apply — a mapper bug cascades into every downstream
+  filter decision.
+- Agent reference: the Datadog Agent mapper
+  (`pkg/dogstatsd/mapper/mapper.go`) is the equivalence target; the wildcard `([^.]*)` translation
+  and `$1`/`${1}` expansion syntax mirror Agent behavior (tests `test_use_regex_expansion_alternative_syntax`,
+  `test_expand_name`).
+
+## Failure scenario
+
+A mapping profile is configured (statically or pushed at runtime via the config stream). For an input
+metric name, ADP's mapper produces a different `(name, tags)` than the Datadog Agent mapper would —
+e.g. different capture-group expansion for overlapping wildcards, different first-match selection
+across profiles, different handling of a name that matches the wildcard char class but not the
+Agent's, or a tag injected/dropped where the Agent would do the opposite. The metric is then
+aggregated and forwarded under the wrong identity. This is silent: no error, no drop counter; the
+customer sees a metric that does not match the Agent's output for the same workload + mapper config.
+
+## Property
+
+- **Type:** Safety (differential).
+- **Invariant:** Harness/diff-side `Always(mapped (name, tags) within ratio of Agent mapper output)`
+  per flush window, anchored on the existing `panoramic`/`stele` diff harness but with a
+  **mapper-exercising corpus** (millstone names crafted to hit the configured profiles) and an
+  identical `dogstatsd_mapper_profiles` config on both the Agent baseline and ADP. Pair with
+  `Sometimes(mapper remapped a metric)` so the diff is not vacuous (the corpus actually triggers
+  remapping). A SUT-side `Sometimes(cache hit returned same result as a fresh miss)` localizes the
+  result-cache correctness facet.
+- **Antithesis angle:** Differential equivalence under (a) overlapping/ambiguous profiles that probe
+  first-match ordering, (b) names at the wildcard char-class boundary, (c) the same config delivered
+  at runtime over the config stream vs. statically, and (d) fault-induced flush-timing skew (the
+  `panoramic` harness alone runs one deterministic order; faults explore reordering). Run as the
+  Add-on 2 diff topology with a `dogstatsd_mapper_profiles` config on both sides.
+- **Priority:** High.
+
+## Config dependencies
+
+- `dogstatsd_mapper_profiles` (JSON array of `{name, prefix, mappings:[{match, match_type, name, tags}]}`)
+  must be set identically on the Agent baseline and ADP.
+- `dogstatsd_mapper_cache_size` (default 1000) — exercise both cache-on and cache-off (`0`) to cover
+  the cache path vs. the slow path returning the same result.
+- `dogstatsd_mapper_string_interner_size` (default 64 KiB) — interacts with
+  `mapper-interner-bounded`; keep generous here so interner exhaustion does not confound the diff.
+
+## Open Questions
+
+- Does the Datadog Agent mapper apply **all** matching mappings within a profile, or only the first?
+  ADP returns on the first matching mapping (`mod.rs:332`). If the Agent differs, this is itself a
+  bug, not just a test-setup detail. `(needs Agent-source confirmation)`
+- Does the Agent restrict wildcard match characters to the same `[a-zA-Z0-9\-_*.]` class
+  (`ALLOWED_WILDCARD_MATCH_PATTERN`, `mod.rs:31-32`)? A name the Agent maps but ADP's class rejects
+  at *config-load* (build error) vs. *match-miss* changes the observable divergence.
+- Is `FLUSH_WAIT`≈32s on both sides enough once faults delay flushes (timing-artifact false diffs)?
+  Same caveat as `aggregate-matches-agent`.
+- Can the config stream actually push `dogstatsd_mapper_profiles` at runtime, and does the mapper
+  rebuild on that key? (The mapper has **no `watch_for_updates`** — see Open Questions of
+  `filter-config-reload-correct`; the mapper appears static-only, unlike the filters.) Determines
+  whether the runtime-config facet of this property is reachable.
+
+### Investigation Log
+
+- Examined: full `dogstatsd_mapper/mod.rs` incl. all unit tests; `run.rs:638-679` pipeline wiring;
+  `resolver.rs:436-483` for the `Option<Context>` resolution semantics.
+- Found: mapper is a `SynchronousTransform` (`mod.rs:388-398`) with first-match-wins selection and
+  capture-group expansion; equivalence to the Agent is claimed via mirrored expansion syntax and the
+  wildcard translation, but there is **no differential test** against the Agent — only self-consistent
+  Rust unit tests.
+- Note: the mapper has no config-stream watcher, so unlike the filters it is configured once at build;
+  the runtime-config facet is likely Unreachable for the mapper specifically (flag for the team).
diff --git a/test/antithesis/scratchbook/properties/memory-limiter-survives-rss-read-failure.md b/test/antithesis/scratchbook/properties/memory-limiter-survives-rss-read-failure.md
new file mode 100644
index 00000000000..a0998f6ffba
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/memory-limiter-survives-rss-read-failure.md
@@ -0,0 +1,105 @@
+# memory-limiter-survives-rss-read-failure
+
+**Family:** Resource Boundaries — memory / fault tolerance
+**Status:** Verified against code at commit 042f41db3b. Property is expected to **FAIL by design**:
+an RSS read failure mid-run panics the limiter thread and silently freezes backpressure. Needs
+SUT-side instrumentation to express well.
+
+## What led to the property
+
+`sut-analysis.md` §4 and §7 flag the limiter checker thread's `.expect()` on the RSS read as a
+fail-open hazard: if `/proc` reads start failing mid-run (transient `/proc` unavailability, fault
+injection, namespace/cgroup churn), the dedicated checker thread panics and dies. Memory
+protection — already the only runtime memory mechanism — silently vanishes, and the
+last-written backoff value is **frozen** in the `AtomicU64` forever.
+
+## Behavior in code
+
+`MemoryLimiter::new` smoke-tests RSS availability once at construction
+(`lib/resource-accounting/src/limiter.rs:43-44`: `Querier::default().resident_set_size()?` — if
+unavailable at startup, returns `None` and the caller falls back to a noop limiter via
+`accounting.rs:175-177`). But the **steady-state checker loop** does not tolerate later failures
+(`limiter.rs:99-122`):
+```rust
+loop {
+    let actual_rss = querier
+        .resident_set_size()
+        .expect("memory statistics should be available");   // <-- panics the thread mid-run
+    let maybe_backoff_duration = calculate_backoff(...);
+    match maybe_backoff_duration {
+        Some(d) => active_backoff_nanos.store(d.as_nanos() as u64, Relaxed),
+        None    => active_backoff_nanos.store(0, Relaxed),
+    }
+    std::thread::sleep(Duration::from_millis(250));
+}
+```
+Consequences when `resident_set_size()` returns `None` mid-run:
+1. `.expect()` panics. The thread is a bare `std::thread` (`limiter.rs:54-65`), not a supervised
+   task and not in the data-topology JoinSet — its death does **not** trigger the process-wide
+   shutdown that data-component exits cause. The process keeps running.
+2. The loop stops updating `active_backoff_nanos`. Whatever value was last stored is **frozen**:
+   - If it was 0 (RSS was below threshold when reads failed), backpressure is permanently off —
+     fail-open, no protection. Cooperating tasks (`wait_for_capacity`, `limiter.rs:83-88`) never
+     wait again even as RSS climbs.
+   - If it was a nonzero backoff, that exact backoff is applied forever regardless of actual RSS
+     — including after RSS would have dropped, needlessly throttling the source indefinitely.
+3. No telemetry or log surfaces the thread death; `memory_limiter.current_backoff_secs`
+   (`limiter.rs:111,116`) simply stops updating. Observability goes stale silently.
+
+So the property — "memory protection remains active (or the failure is surfaced) when RSS reads
+fail" — is violated: protection silently freezes and the failure is not surfaced.
+
+## Failure scenario (Antithesis)
+
+Run with the limiter enabled (`memory_mode: permissive|strict` + `memory_limit` set,
+`enable_global_limiter: true`). Use Antithesis fault injection to make RSS reads fail mid-run
+(e.g. interfere with `/proc/self/statm` or the platform stat source the `process_memory::Querier`
+uses) while a load generator pushes RSS toward the limit. Observe that the checker thread dies,
+backoff freezes, and no error is surfaced. The race that makes the freeze damaging — reads fail
+*before* RSS crosses the threshold, leaving backoff at 0 — is exactly the kind of timing-ordering
+Antithesis explores and the deterministic harness cannot.
+
+## Suggested assertions (NET-NEW — see existing-assertions.md: NO SDK assertions exist)
+
+This property needs SUT-side instrumentation; a workload-only checker can only see "RSS grew
+unbounded after a fault," which is indistinguishable from the off-by-default limiter case.
+
+- `Unreachable` on the panic path: wrap the RSS read so the `.expect()` site is replaced with a
+  branch that, if it would have panicked, fires `assert_unreachable("limiter RSS read failed —
+  protection lost")`. The panic-the-checker-thread state is a critical-failure state that should
+  never be observed. (Today it IS observed — that is the finding; the assertion makes it a
+  reportable property rather than a silent thread death.)
+- Alternatively, if the fix is to surface-and-continue: `Sometimes(rss_read_failed_and_surfaced)`
+  on a new error-reporting path (logged + telemetry incremented + protection conservatively kept
+  active, e.g. retain a safe backoff). `Sometimes` because the failure is a rare optional path we
+  want to prove is *handled* at least once, not an always-true invariant.
+- Anchor a `Sometimes(active_backoff_nanos updated within last N ms)` liveness check to detect a
+  frozen/dead checker — proves the limiter is still doing work, not stuck.
+
+Honest framing: today there is no surfaced-error path, so the realistic immediate assertion is
+the `Unreachable` on the panic site (which will fire), documenting the fail-open. The
+`Sometimes(surfaced)` form presupposes an SUT-side fix and should be tagged as fix-dependent.
+
+## Configuration dependencies
+
+- Requires the limiter to actually be running: `memory_mode != disabled`, `memory_limit` set,
+  `enable_global_limiter: true`. Under the default `disabled` mode the limiter is a noop with no
+  checker thread, so this failure mode does not even exist (a separate, larger problem — no
+  protection at all; see `rss-bounded-under-cardinality`).
+- Platform: `process_memory::Querier` backing source (e.g. `/proc` on Linux) determines what
+  "RSS read failure" means and how to inject it.
+
+## Open questions
+
+- Can `process_memory::Querier::resident_set_size()` actually return `None`/error *after*
+  succeeding once at startup, on the Antithesis Linux target? If the underlying read effectively
+  cannot fail post-startup on this platform, the panic is only reachable via injected `/proc`
+  corruption — which determines whether this is a realistic production risk or a fault-injection-
+  only curiosity. This is the pivotal question for priority.
+- Is the frozen backoff value more likely 0 (fail-open, no protection) or nonzero (fail-stuck,
+  over-throttle) in practice? Both are bugs but with opposite symptoms; determines which
+  assertion/observable to lead with.
+- Should the correct behavior be "keep last-known protection" or "fail loudly and shut down"?
+  Given data components are fail-stop and the container s6 supervisor restarts ADP on exit, a
+  loud crash might be *safer* than silent freeze. The intended remediation changes whether the
+  property is framed as Unreachable(panic) or Reachable(clean restart).
diff --git a/test/antithesis/scratchbook/properties/no-silent-interconnect-drop.md b/test/antithesis/scratchbook/properties/no-silent-interconnect-drop.md
new file mode 100644
index 00000000000..0e04f13fc36
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/no-silent-interconnect-drop.md
@@ -0,0 +1,114 @@
+---
+slug: no-silent-interconnect-drop
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Safety
+priority: High
+assertion_status: MISSING (net-new instrumentation)
+---
+
+# Property: No silent inter-component drop on a correctly-wired edge
+
+## Origin
+SUT analysis §4 ("Backpressure is real and is the load-safety mechanism") and §5
+safety guarantee #1. Decomposed from the headline guarantee "ADP will not crash
+under load, losing customer data" — the *no silent loss* half. No Antithesis
+assertion exists (existing-assertions.md).
+
+## What the code does
+
+### Backpressure (the await-on-full path)
+`lib/saluki-core/src/topology/interconnect/dispatcher.rs`:
+- `DispatchTarget::send` (~86-123): when senders are present, it sends to all but
+  the last sender via `sender.send(item.clone()).await` (~99-104) and the last via
+  `last_sender.send(item).await` (~107-111). `mpsc::Sender::send().await` **blocks
+  when the channel is full** — this is the backpressure. It returns `Err` only if
+  the receiver has been dropped (channel closed), mapped to a `GenericError`.
+- All edges are bounded `tokio::mpsc` (SUT analysis §2/§4). A slow downstream stalls
+  the upstream send, which (in the DSD source) stalls the read loop:
+  `sources/dogstatsd/mod.rs:1186` `memory_limiter.wait_for_capacity().await` plus the
+  socket read in the same `'read` loop, so backpressure propagates to the socket.
+
+### Silent discard (the disconnected-output path)
+`dispatcher.rs:86-92`: when `self.senders.is_empty()`, `send` increments
+`events_discarded_total` by `item.item_count()` and returns `Ok(())` — the events
+are **dropped silently** (no error, no backpressure). This only happens for an
+output with **zero** connected senders (a disconnected/un-wired output).
+
+### Existing unit test (not an Antithesis assertion)
+`sources/dogstatsd/mod.rs:2040-2063` `packet_forwarder_waits_when_queue_is_full`:
+fills `FORWARDER_QUEUE_CAPACITY` then asserts a further `forward()` does NOT complete
+within 100ms ("forwarding should wait for queue capacity instead of dropping").
+Confirms the intended backpressure (await, not drop) behavior at the statsd-forward
+boundary specifically.
+
+## Failure scenario (Antithesis)
+Sustained DSD load + a deliberately slow downstream consumer (e.g. throttle the
+forwarder/intake so the encoder→forwarder edge and then all upstream edges fill).
+Expectation: events are queued/awaited (backpressure to the socket), NOT discarded.
+On a correctly-wired edge `events_discarded_total` must stay at 0; the only visible
+effect is rising latency / falling socket read rate.
+
+## Key observations
+- The discard path and the backpressure path are mutually exclusive and chosen purely
+  by `senders.is_empty()`. So the safety statement is precise: **a wired edge never
+  discards; only a zero-sender edge discards.**
+- Partial-delivery hazard (SUT §4): send to N senders is sequential and not atomic —
+  if a *later* sender errors (receiver dropped) after earlier clones already sent, the
+  earlier sends are not rolled back. This is a *connection-closed* (shutdown/teardown)
+  case, not a full-channel case, so it does not contradict the no-discard-under-load
+  property but should be excluded from the assertion window (see Open Questions).
+
+## Config deps
+- Channel/`interconnect_capacity` default still unread (SUT §9 open question) — sets how
+  much buffering exists before backpressure engages; affects timing, not correctness.
+- The discard path is reachable only by a topology wiring with an unconnected output.
+  In the production DSD blueprint all three DSD outputs are wired, so on the production
+  path the discard branch should be Unreachable under load.
+
+## Suggested assertion (MISSING — net-new)
+- **Always** on the wired edge: at every check, `events_discarded_total` for a connected
+  output does not increase under sustained load (i.e. delta == 0). Anchor on the
+  `events_discarded_total` counter scoped per output. Safety, every-check.
+- **Sometimes(backpressure engaged)**: at least once, the source read loop is observed
+  blocked on a full downstream channel (meaningful progress into the throttled state) —
+  proves the fault actually exercised backpressure rather than the load being too light.
+  Could be read from rising send_latency_seconds or a workload-side stall signal.
+
+## SUT-side instrumentation needs
+- `events_discarded_total` is already emitted; a workload-side checker can read it via
+  the telemetry endpoint. For a crisp `Always`, an SDK `assert_always` at the discard
+  site (`dispatcher.rs:90`) gated to "output has a name on the production DSD path" would
+  fire only on the must-never branch — but note the discard branch is *legitimately*
+  reachable for genuinely disconnected outputs, so a blanket `assert_unreachable` there
+  is wrong. Prefer reading the counter from the workload for wired edges.
+
+## Open questions
+- **Does any production DSD output ever legitimately have zero senders?** If a named output
+  (e.g. `dsd_debug_log_out` when debug logging disabled, or `dsd_stats_out`) is conditionally
+  unwired, the discard path is reachable by config and the `Always(delta==0)` must be scoped
+  to the always-wired outputs (`metrics`, `events`, `service_checks`, `dd_out` chain). If all
+  outputs are always wired, the assertion can cover every edge.
+- **Should partial-delivery on receiver-drop be excluded?** Yes during teardown; confirm the
+  assertion window ends at shutdown signal so the not-atomic multi-sender path (a closed
+  channel, not a full one) does not produce false positives.
+
+## Investigation Log
+
+#### Default `interconnect_capacity` (bounded mpsc size on topology edges)
+- **Examined**: `lib/saluki-core/src/topology/mod.rs:37`; `blueprint.rs:56,76,87-88,94`;
+  `built.rs:73,431-433,651-661` (channel construction); searched `bin/agent-data-plane` and
+  `lib/saluki-app` for `with_interconnect_capacity` / overrides.
+- **Found**: `const DEFAULT_INTERCONNECT_CAPACITY: NonZeroUsize = NonZeroUsize::new(128).unwrap();`
+  (`mod.rs:37`). `TopologyBlueprint::new` seeds `interconnect_capacity` to this default
+  (`blueprint.rs:76`). Each non-source event/payload edge builds a `mpsc::channel(interconnect_capacity.get())`
+  (`built.rs:653,661`), so every interconnect channel is bounded at **128** entries by default.
+  `with_interconnect_capacity()` exists (`blueprint.rs:87`) but no call site sets it outside the
+  unit test at `built.rs:717` (capacity 10). No ADP/app config overrides it.
+- **Not found**: No runtime/config knob exposing `interconnect_capacity`; it is a hardcoded
+  compile-time default with only a programmatic setter that production code never calls.
+- **Conclusion**: RESOLVED. Default interconnect capacity = **128** events/payloads per edge.
+  Note this is a count of *event buffers* (`EventsBuffer`), not individual metrics, so the burst
+  absorbed before backpressure is 128 buffers per downstream component. Property framing unchanged;
+  this only sizes workload load to reach the full-channel state.
diff --git a/test/antithesis/scratchbook/properties/non-finite-values-handled-consistently.md b/test/antithesis/scratchbook/properties/non-finite-values-handled-consistently.md
new file mode 100644
index 00000000000..bf202ae031d
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/non-finite-values-handled-consistently.md
@@ -0,0 +1,138 @@
+---
+slug: non-finite-values-handled-consistently
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Safety
+priority: Medium
+assertion_status: MISSING (net-new instrumentation)
+---
+
+# Property: Non-finite metric values are handled consistently, never crash, no ghost metric
+
+## Origin
+SUT analysis §7 #7 ("non-finite metric values silently dropped"), #10 ("NaN poisons a
+DDSketch … finiteness guarded per-source, fragile if a new producer is added"), #11
+("All-non-finite packet → ghost metric with a valid context but zero data points"). No
+Antithesis assertion exists (existing-assertions.md).
+
+## What the code does
+`lib/saluki-io/src/deser/codec/dogstatsd/metric.rs`:
+- `FloatIter::next` (286-307): parses `:`-delimited values; `Ok(value) if value.is_finite()`
+  → yields the value; `Ok(_)` (non-finite NaN/±Inf) → `debug!("Dropping non-finite …")` and
+  **loops to the next value** (does not yield, does not error); a true parse failure → `Err`.
+- `metric_values_from_raw` (250-271): `num_points` is incremented via
+  `FloatIter::new(input).inspect(|_| num_points += 1)` (253-254) — `inspect` fires only for
+  *yielded* items, so **non-finite values do not increment `num_points`**. An all-non-finite
+  packet → `num_points == 0` and the metric-values constructor (e.g. `gauge_fallible(floats)`,
+  258) receives an empty iterator, producing an empty `MetricValues` ("ghost" shape: valid
+  name/type, zero points).
+- Verified by `metric.rs::non_finite_metric_values_are_dropped` (743-759): asserts
+  `packet.num_points == 0` for `NaN|g`, `inf|g`, `-inf|g`.
+
+`lib/saluki-components/src/sources/dogstatsd/mod.rs::handle_frame` (1458-1542):
+- **1478-1480**: `if metric_packet.num_points == 0 { return Ok(None); }` — the zero-point
+  packet is dropped **before** context resolution (`handle_metric_packet`, 1490). So at the
+  source level, an all-non-finite packet consumes **no** interner/context-cache resources and
+  produces no downstream event. Confirmed by `non_finite_metric_values_are_silently_dropped`
+  (mod.rs:2470-2485): "handle_frame then returns Ok(None) for zero-point packets."
+- A *partial* packet (`x:NaN:5|g`) → `num_points == 1`, the finite value flows normally and the
+  non-finite one is dropped — consistent with the Agent.
+
+## Failure scenario (Antithesis)
+Send packets that are entirely non-finite (`m:NaN|g`, `m:inf:nan:-inf|h`, multi-value all-NaN
+distributions) and mixed finite/non-finite, across all metric types (c/g/ms/h/d/s). Expectations:
+(1) no panic; (2) all-non-finite packets produce **no** downstream metric and consume no
+interner/context-cache slot (no ghost metric reaches aggregation); (3) finite values in mixed
+packets are preserved exactly; (4) no NaN ever reaches a DDSketch (the §7 #10 poisoning hazard).
+
+## Key observations
+- **The ghost-metric risk is gated at the source today** by the `num_points == 0` short-circuit
+  (mod.rs:1478). Frame honestly: the property asserts the gate *holds* — an all-non-finite packet
+  must not create a context/sketch — rather than claiming a ghost metric exists on the DSD path.
+- The §7 #10 NaN-poisons-DDSketch hazard is real but currently *prevented* by the per-source
+  finiteness filter in `FloatIter`. The fragility is structural: finiteness is enforced in the
+  **codec**, not at the **sketch boundary** (`agent/sketch.rs`). A new producer (e.g. OTLP, or a
+  replay path that bypasses the codec) could feed NaN to a sketch. The property should assert the
+  invariant *at the sketch boundary* so it's robust to new producers.
+- Set metrics (`s`) take a different path (metric.rs:259-265, `num_points = 1` unconditionally, value
+  is the raw string) — non-finite-ness doesn't apply; exclude sets from the value-finiteness check.
+
+## Config deps
+- `permissive` mode and value parsing don't change finiteness handling.
+- The sketch-boundary check matters only for histogram/distribution/timer types (which build sketches);
+  gauge/counter store raw f64 values.
+
+## Suggested assertion (MISSING — net-new)
+- **Always**(no NaN in a sketch): `assert_always(value.is_finite())` at the DDSketch insert boundary
+  (`agent/sketch.rs` insert, ~188-206) — generalizes the per-source guard to the sketch itself, robust
+  to new producers. Catches §7 #10 directly.
+- **AlwaysOrUnreachable**(no zero-point metric reaches aggregation): an all-non-finite packet must not
+  produce a downstream `Event::Metric` with empty values — anchor at handle_frame (mod.rs:1478) /
+  aggregation insert. If the gate ever lets a zero-point metric through, that's the ghost metric.
+- **Sometimes(non_finite dropped)**: at least once, `FloatIter` drops a non-finite value (proves the
+  adversarial all-NaN input actually exercised the drop path). Meaningful state, not `Sometimes(true)`.
+- **Sometimes(ghost-metric path reachable)** — *only if* a producer that bypasses the `num_points==0`
+  gate is found (e.g. replay/OTLP). On the pure DSD path this is expected **Unreachable**; do not assert
+  `Sometimes` for it on DSD-only without first confirming a bypass exists (see Open Questions).
+
+## SUT-side instrumentation needs
+- The sketch-boundary `Always(is_finite)` and the zero-point-gate check need SDK assertions inside the
+  SUT; a workload-only checker sees aggregated output and cannot attribute a NaN sum to a sketch insert.
+- A `non_finite_dropped` counter (or assertion at metric.rs:301) gives the `Sometimes` reachability anchor.
+  No such counter exists today — currently only a `debug!` log.
+
+## Open questions
+- **Does `gauge_fallible([])` / the empty-iterator constructors ever return an `Err`** (the `_fallible`
+  suffix) rather than an empty value? If they error on empty input, the path is even safer
+  (handle_frame returns `Err`, counted) — confirm the empty-iter behavior. Changes whether num_points==0
+  is the only gate.
+- **Is `avg`/`sum` on an empty/NaN sketch ever surfaced as a metric** (the §7 #10 "permanently NaN")? Even
+  with the source guard, confirm a sketch can't reach a NaN aggregate via merge of pre-timestamped/
+  passthrough points. Affects whether the sketch-boundary `Always` is sufficient or a merge-time check is
+  also needed.
+
+### Investigation Log
+
+#### Is there any path that builds a metric/sketch from values WITHOUT going through FloatIter?
+- **Examined:** all metric/sketch producers and their topology wiring — DSD codec FloatIter
+  (`lib/saluki-io/src/deser/codec/dogstatsd/metric.rs:254,299`); OTLP source
+  (`lib/saluki-components/src/sources/otlp/metrics/translator.rs`, incl. `get_number_data_point_value`
+  :1366, `is_skippable` :1374, `map_number_metrics` :726, the histogram→`Dogsketch::try_from` path
+  :314-351 and the explicit-bounds `insert_interpolate_buckets` path :889); self-telemetry
+  (`lib/saluki-core/src/observability/metrics/mod.rs:299-310`); checks_ipc
+  (`lib/saluki-components/src/sources/checks_ipc/mod.rs:185-204`); the aggregate
+  histogram→distribution `insert_n` (`lib/saluki-components/src/transforms/aggregate/mod.rs:745`);
+  the Datadog metrics encoder Histogram→sketch conversion
+  (`lib/saluki-components/src/encoders/datadog/metrics/mod.rs:1049-1058`); and the ADP topology
+  (`bin/agent-data-plane/src/cli/run.rs:462-499, 664-686, 745-755`). Full detail captured in
+  ddsketch-no-nan-poison.md Investigation Log.
+- **Found:** YES — multiple producers build metrics/sketches without FloatIter, but they fall into
+  three categories:
+  - **OTLP — guarded by its OWN finiteness filter.** Number/gauge/counter values pass `is_skippable`
+    (NaN/Inf skipped, translator.rs:726/754); histogram sketches are built via `Dogsketch::try_from`
+    (no raw insert) or `insert_interpolate_buckets` (which reconstructs finite `bin_lower_bound`
+    values before `adjust_basic_stats`). OTLP does not poison sum/avg with NaN. So OTLP does NOT make
+    the ghost/poison path live.
+  - **Aggregate transform `insert_n` (mod.rs:745) — DSD-ONLY.** `dsd_agg` is wired only in the DSD
+    pipeline (run.rs:664-679); checks_ipc and OTLP join at `metrics_enrich`, downstream of `dsd_agg`.
+    So the aggregate sketch path receives only FloatIter-filtered (finite) values. Not a bypass.
+  - **checks_ipc Histogram → Datadog metrics encoder — A REAL BYPASS (live).** checks_ipc
+    (mod.rs:195) builds `Metric::histogram(context, (timestamp, value))` from an external Python
+    check's raw f64 with **no finiteness check**, and routes `checks_ipc_in.metrics → metrics_enrich
+    → dd_metrics_encode` (run.rs:469/499) — skipping both FloatIter and the aggregate transform. The
+    encoder converts the Histogram to a sketch via `ddsketch.insert_n(sample.value...)`
+    (encoders/datadog/metrics/mod.rs:1054), so a NaN check value poisons the emitted sketch's
+    sum/avg. (Note: this poisons sum/avg but does not create the *zero-point ghost* shape — the
+    ghost-metric/`num_points==0` concern is specific to the DSD `FloatIter`+`num_points` interaction
+    and remains gated on the DSD path at mod.rs:1478.)
+- **Not found:** No finiteness guard on the checks_ipc value path; no guard at the sketch boundary;
+  no third metric ingress that bypasses both FloatIter and a per-source filter.
+- **Conclusion:** RESOLVED. The NaN-poison path (#10) is **LIVE** via checks_ipc → Datadog metrics
+  encoder, independent of the DSD codec. The ghost-metric (#11, zero-point) shape is NOT reproduced
+  by this bypass (it is DSD-`FloatIter`-specific and gated at handle_frame mod.rs:1478). Because the
+  finiteness invariant is enforced per-producer (DSD FloatIter, OTLP is_skippable) and NOT at the
+  sketch boundary, the suggested sketch-boundary `assert_always(value.is_finite())` is the
+  robust, producer-independent assertion and is justified by a concrete live bypass. The
+  `Sometimes(ghost-metric)` assertion should remain Unreachable-style on the DSD path; the live NaN
+  exposure is a *poisoned sum/avg* at the encoder, not a zero-point ghost.
diff --git a/test/antithesis/scratchbook/properties/prefix-filter-ordering-matches-agent.md b/test/antithesis/scratchbook/properties/prefix-filter-ordering-matches-agent.md
new file mode 100644
index 00000000000..f4ef0027c66
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/prefix-filter-ordering-matches-agent.md
@@ -0,0 +1,120 @@
+# prefix-filter-ordering-matches-agent
+
+## Origin
+
+Bug-history-sensitive coverage gap. A past correctness fix **"moved DSD prefix/filter in front of
+enrich"** (SUT analysis §8, churn hotspots), and the DSD transform chain now applies four
+name-rewriting/filtering stages in a *specific order* that must match the Datadog Agent's
+listener-side vs. time-sampler split. The order determines which name each downstream stage sees, so
+an ordering regression silently changes filtering outcomes. The `dogstatsd_prefix_filter`
+(listener-side: prefixing + blocklist/filterlist) and `dogstatsd_post_aggregate_filter`
+(time-sampler-side: histogram-aggregate-series filtering) deliberately split responsibility for the
+**same four config keys** — a split with subtle correctness rules and zero end-to-end property.
+
+## Code paths
+
+- Pipeline order (`bin/agent-data-plane/src/cli/run.rs:674-679`):
+  `dsd_in.metrics → dsd_enrich` (chained: `dogstatsd_mapper`, `run.rs:640-641`)
+  `→ dsd_prefix_filter → dsd_tag_filterlist → dsd_agg → dsd_post_agg_filter → metrics_enrich`.
+  - Mapper rewrites the name **first**, so prefix/blocklist see the *mapped* name.
+  - `dsd_prefix_filter` then prefixes (`statsd_metric_namespace`) and drops blocklisted names
+    **before** aggregation.
+  - `dsd_post_agg_filter` runs **after** `dsd_agg`, filtering only the generated histogram-aggregate
+    *series* names the aggregator produced (e.g. `foo.max`, `foo.95percentile`).
+- `dogstatsd_prefix_filter/mod.rs`
+  - `process_metric` (`mod.rs:234-267`): if a prefix is configured, prefixes the name **unless** the
+    name already starts with a `metric_prefix_blocklist` entry (`has_excluded_prefix`, `mod.rs:269-275`);
+    then checks the effective blocklist matcher and drops on match
+    (`events.remove_if(...)`, `mod.rs:298-303`).
+  - Default `metric_prefix_blocklist` is a fixed list of integration prefixes (`datadog.agent`, `jvm`,
+    `kafka`, …) (`mod.rs:67-91`).
+- `dogstatsd_post_aggregate_filter/mod.rs`
+  - `HistogramSuffixes::contains_filter_entry` (`mod.rs:178-186`): a filterlist entry "owns"
+    post-aggregate filtering **only** if it is shaped `<metric>.<aggregate-suffix>` (suffixes derived
+    from `histogram_aggregates` + `histogram_percentiles`, `mod.rs:150-172`). Other entries stay the
+    listener filter's responsibility — the explicit split.
+  - `should_filter_metric` (`mod.rs:238-240`): filters **only scalar series**
+    (`Counter|Rate|Gauge|Set`) — **sketches are kept** (test `sketch_metrics_are_not_filtered`,
+    `mod.rs:528`), matching the Agent time-sampler.
+- Both filters share the same four config keys (`METRIC_FILTERLIST_CONFIG_KEY`,
+  `METRIC_FILTERLIST_MATCH_PREFIX_CONFIG_KEY`, `STATSD_METRIC_BLOCKLIST_CONFIG_KEY`,
+  `STATSD_METRIC_BLOCKLIST_MATCH_PREFIX_CONFIG_KEY`, `dogstatsd_filterlist.rs`) and both reload live
+  (see `filter-config-reload-correct`).
+
+## Failure scenario
+
+- **Ordering regression:** if a refactor moves `dsd_prefix_filter` back behind `dsd_enrich`/mapper or
+  past `dsd_agg`, the prefix/blocklist would see a different name (pre-map, or post-aggregate-expanded)
+  than the Agent's listener filter does → metrics blocklisted-or-not differently, or double-prefixed.
+  The diff suite's happy path may not catch a name that only diverges through this specific stage
+  order.
+- **Split divergence:** an entry like `foo.max` (looks like a histogram aggregate) must be filtered
+  **post-aggregate** (it targets a generated series), while `foo` must be filtered **listener-side**.
+  If `contains_filter_entry`'s suffix detection (`mod.rs:178-186`) disagrees with the Agent's, an
+  entry is filtered at the wrong stage (or both, or neither) → a metric the operator blocklisted is
+  still forwarded, or a raw metric is dropped that should survive to aggregation.
+- **Prefix double-apply / blocklist bypass:** `has_excluded_prefix` logic interacting with a mapped
+  name could prefix a name that already carries an integration prefix, or fail to block a name that
+  only matches after prefixing — silently wrong egress identity.
+
+## Property
+
+- **Type:** Safety (ordering + differential).
+- **Invariant:**
+  - `Always(end-to-end keep/drop + final name within ratio of the Datadog Agent)` for the same
+    `statsd_metric_namespace`, `metric_filterlist`, `statsd_metric_blocklist`, and match-prefix flags
+    — the strongest check, anchored on Add-on 2's differential harness with a corpus that exercises
+    prefixing, blocklisting, and histogram-aggregate-series names.
+  - `AlwaysOrUnreachable(a non-histogram-shaped filterlist entry is NOT applied at the post-aggregate
+    stage)` and conversely `AlwaysOrUnreachable(a histogram-aggregate-series name is NOT dropped at
+    the listener stage by that entry)` — SUT-side, pins the prefix/post-agg responsibility split.
+  - `AlwaysOrUnreachable(post_agg_filter never drops a sketch metric)` (`mod.rs:238-240`).
+  - `Sometimes(prefix added)`, `Sometimes(listener blocklist dropped a metric)`,
+    `Sometimes(post-aggregate filter dropped a generated series)` for non-vacuity.
+  - Optionally a topology-shape assertion (`Always(dsd_prefix_filter is wired between dsd_enrich and
+    dsd_tag_filterlist; dsd_post_agg_filter after dsd_agg)`) read from the built blueprint, to catch
+    an ordering regression structurally.
+- **Antithesis angle:** corpus crafted so the *same* metric name's keep/drop decision depends on the
+  stage order (a name that is blocklisted only pre-prefix, or an entry that is ambiguous between
+  listener and post-aggregate ownership), plus fault-induced flush-timing skew on the differential
+  run. Compose with `mapper-output-matches-agent` (mapper feeds the prefix filter) and
+  `filter-config-reload-correct` (these filters reload live).
+- **Priority:** Medium (High if run as the primary regression tripwire for the prefix/filter-ordering
+  bug class).
+
+## Config dependencies
+
+- `statsd_metric_namespace`, `statsd_metric_namespace_blocklist`, `metric_filterlist`,
+  `metric_filterlist_match_prefix`, `statsd_metric_blocklist`,
+  `statsd_metric_blocklist_match_prefix`, `histogram_aggregates`, `histogram_percentiles` — set
+  identically on both sides for the differential facet.
+- The differential facet needs Add-on 2 (Agent baseline + ADP, identical workload). The split/sketch/
+  ordering invariants run SUT-side on the primary topology.
+
+## Open Questions
+
+- Does the Agent split listener-side vs. time-sampler filtering on exactly the
+  `<metric>.<aggregate-suffix>` shape that `contains_filter_entry` (`mod.rs:178-186`) uses? An
+  off-by-one in suffix detection silently routes an entry to the wrong stage.
+  `(needs Agent-source confirmation)`
+- Is the `dsd_prefix_filter`-before-`dsd_enrich`/after-mapper ordering load-bearing for Agent
+  equivalence (the historical fix moved it), and is there a regression guard today other than this
+  proposed property? The fix suggests ordering is fragile.
+- `has_excluded_prefix` only consults `metric_prefix_blocklist` when a prefix is configured
+  (`mod.rs:269-275`); does the Agent skip prefixing for the same default integration-prefix set
+  (`mod.rs:67-91`), and does mapping a name change whether it carries such a prefix?
+- Both filters read the same four keys via separate watchers — a reload that updates one but lags the
+  other (compose with `filter-config-reload-correct`) could transiently filter at one stage but not
+  the other for the same logical rule. Confirm reachability.
+
+### Investigation Log
+
+- Examined: `run.rs:638-679` (chain wiring + order), full `dogstatsd_prefix_filter/mod.rs`
+  (process_metric, has_excluded_prefix, default blocklist, reload arms), full
+  `dogstatsd_post_aggregate_filter/mod.rs` (HistogramSuffixes, scalar-series gate, sketch exclusion,
+  reload arms).
+- Found: a deliberate listener-vs-time-sampler split over four shared keys, an ordering the codebase
+  history shows is correctness-fragile, and only self-consistent unit tests — no end-to-end ordering
+  or differential property. Distinct from `mapper-output-matches-agent` (name rewrite) and
+  `tag-filterlist-applied-consistently` (per-metric tag stripping); this owns the **prefix/blocklist +
+  post-aggregate split and the stage ordering**.
diff --git a/test/antithesis/scratchbook/properties/replay-corruption-not-silent-eof.md b/test/antithesis/scratchbook/properties/replay-corruption-not-silent-eof.md
new file mode 100644
index 00000000000..c2223268551
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/replay-corruption-not-silent-eof.md
@@ -0,0 +1,120 @@
+---
+slug: replay-corruption-not-silent-eof
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Safety
+priority: Medium
+assertion_status: MISSING (net-new instrumentation)
+---
+
+# Property: Capture corruption is distinguishable from a clean EOF (no silent truncation)
+
+## Origin
+SUT analysis §7 #12 ("Replay reader treats corruption as clean EOF … silently truncates
+the remaining record stream — false replay-fidelity confidence"). No Antithesis assertion
+exists (existing-assertions.md). **Framed honestly: the current code intentionally returns
+`Ok(None)` on these inputs; this property captures the data-fidelity risk, it does not
+claim the code is wrong today.**
+
+## What the code does
+`lib/saluki-components/src/sources/dogstatsd/replay/reader.rs::read_next` (84-104):
+```rust
+if self.offset + LENGTH_PREFIX_SIZE > self.contents.len() { return Ok(None); } // (a)
+let size = u32::from_le_bytes(...) as usize;
+self.offset += LENGTH_PREFIX_SIZE;
+// "The writer emits a zero-length prefix to mark the start of the tagger state trailer;
+//  treat that (and any size that would overrun the buffer) as the end of the record stream."
+if size == 0 || self.offset + size > self.contents.len() { return Ok(None); }       // (b)
+```
+Three distinct conditions all collapse to the **same** `Ok(None)` "clean EOF" signal:
+1. Legitimate end: offset reached the start of the zero-length trailer separator (`size == 0`).
+2. Truncation: a record length prefix is present but the body is cut short
+   (`offset + size > len`).
+3. Corruption: a corrupt length prefix that happens to read as `0` or as an oversized value.
+
+The driver (`bin/agent-data-plane/src/cli/dogstatsd.rs:367-373`) treats `Ok(None)` as
+"replay iteration completed" and stops sending packets — so cases 2 and 3 **silently drop
+every remaining record** with no error and no telemetry.
+
+Tests currently *assert* this behavior: `truncated_record_returns_none` (245-257) writes a
+file with the last 8 bytes dropped and asserts `read_next()` yields `Ok(None)` ("clean EOF on
+truncation"); `read_next_stops_at_state_separator` (233-242) asserts the trailer boundary →
+`Ok(None)`. So the silent-truncation behavior is encoded as intended.
+
+Contrast: `read_state` (126-131) **does** return an `Err` when the trailer size overruns the
+buffer — so the codebase already distinguishes "oversized length → error" in the trailer path
+but not in the record path. This asymmetry is the crux of the property.
+
+## Failure scenario (Antithesis)
+Replay a capture that is valid for the first N records, then has a corrupt 4-byte length
+prefix (e.g. a flipped byte making `size` huge, or a zeroed prefix mid-stream). The reader
+returns `Ok(None)` at that point; the replay tool reports success having sent only N of M
+records. A diff against the capture's true record count would reveal the loss, but the tool
+itself signals success — the fidelity loss is invisible.
+
+## Key observations
+- This is a *data-fidelity* property, not a crash property. The "bad thing" is **silent**
+  truncation reported as success, not a panic.
+- A faithful fix would track whether the offset reached exactly the trailer separator vs. ran
+  off a malformed prefix, and surface the latter as `Err`. The trailer path (read_state) already
+  does this for oversize. The property can be stated without demanding a fix: *if records were
+  truncated by corruption, the replay must not report a clean completion.*
+- Because the legitimate-EOF case (size==0 separator) and the truncation case are byte-shape
+  identical from `read_next`'s local view, distinguishing them requires either a record count in
+  the header/trailer or an explicit corruption sentinel — neither exists today (open question).
+
+## Config deps
+- Same gating as `replay-no-panic-on-malformed-capture`: `dogstatsd replay` subcommand, UDS
+  listener, Linux-only.
+- File version ≥ MIN_STATE_VERSION (2) means a trailer is expected (file_header.rs:11); the
+  separator-vs-truncation ambiguity is most acute for versioned files that *should* have a trailer.
+
+## Suggested assertion (MISSING — net-new)
+- **AlwaysOrUnreachable**(replay completion is faithful): when `read_next` returns the
+  terminating `Ok(None)`, the consumed offset equals the start of the tagger-state trailer
+  (clean end) — i.e. the loop did not stop because of an unconsumed/over-running length prefix.
+  Realize as an SDK assertion at the (b) branch (reader.rs:95) distinguishing
+  `size == 0 && at_trailer_boundary` (clean) from the overrun/`size==0`-mid-stream case (corrupt).
+- **Sometimes(corruption-detected)**: at least once, the reader reaches the (b) branch with a
+  length prefix that overruns the buffer (proves the corrupt input actually exercised the path,
+  not just clean EOF). Meaningful state, not `Sometimes(true)`.
+
+## SUT-side instrumentation needs
+- A workload-only check cannot tell truncation from clean EOF (both look like "replay finished").
+  Needs an SDK assertion at reader.rs:95 (or a new telemetry counter `replay_records_truncated`)
+  to expose the corrupt branch. Could also compare a record count emitted at capture time against
+  records replayed.
+
+## Open questions
+- **Is there a record count or total-length field anywhere** (header/trailer) that would let the
+  reader detect "stopped early"? If not, distinguishing truncation from clean EOF requires a format
+  change. Determines whether this can be a strict `Always` or only a best-effort heuristic.
+- **Do the maintainers consider silent truncation acceptable for replay** (best-effort tool) vs. a
+  fidelity bug? The intentional `Ok(None)` and the asserting tests suggest "accepted"; this property
+  documents the risk and lets Antithesis quantify how often corruption silently truncates. Changes
+  priority, not the property statement.
+- **How does a corrupt prefix that decodes to a small but wrong `size` behave?** It would decode the
+  wrong bytes as a record body (case where `offset+size <= len` but bytes are garbage) → either a
+  prost decode `Err` (good, surfaced) or, worst case, a successfully-decoded *wrong* record. Worth
+  enumerating: this is a third outcome (silent corruption rather than silent truncation).
+
+### Investigation Log
+
+#### Do the maintainers consider silent truncation acceptable, or is it a fidelity bug? `(needs human input)`
+
+- **Examined**: `lib/saluki-components/src/sources/dogstatsd/replay/reader.rs` `read_next`
+  (length-prefix parse ~84-104, the `size == 0 || offset+size > len → Ok(None)` collapse); the
+  reader's own unit tests (~244-257) which feed truncated/oversized prefixes and **assert** the
+  result is `Ok(None)`; the capture-file format (`UnixDogstatsdMsg` records + a `TaggerState`
+  trailer, `writer.rs`) for any record-count or total-length field.
+- **Found**: the silent-truncation behavior is *intentional in code* — `Ok(None)` is the deliberate
+  return for both legitimate EOF and a corrupt/over-running prefix, and the tests pin that behavior
+  as desired. There is **no record-count or total-length field** in the format, so the reader has no
+  in-band way to distinguish "stopped early" from "clean end."
+- **Not found**: any comment, doc, ADR, or commit message stating whether silent truncation is an
+  accepted best-effort property of the replay tool or a known fidelity gap. Code intent ("we return
+  Ok(None)") is clear; *product* intent ("is that OK?") is not recoverable from the repo.
+- **Conclusion**: tagged `(needs human input)`. The behavior is unambiguous; only the maintainers can
+  say whether it is acceptable. The answer changes this property's **priority** (and whether a
+  format change adding a record count is warranted), not the property statement.
diff --git a/test/antithesis/scratchbook/properties/replay-no-panic-on-malformed-capture.md b/test/antithesis/scratchbook/properties/replay-no-panic-on-malformed-capture.md
new file mode 100644
index 00000000000..e9154049524
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/replay-no-panic-on-malformed-capture.md
@@ -0,0 +1,130 @@
+---
+slug: replay-no-panic-on-malformed-capture
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Safety
+priority: High
+assertion_status: MISSING (net-new instrumentation)
+---
+
+# Property: Parsing an arbitrary/corrupt/truncated DogStatsD capture never panics
+
+## Origin
+SUT analysis §6 gap #6 ("DogStatsD replay has zero suite coverage despite being the
+newest, largest, untrusted-input feature"), §7 #12, §8 ("Most regression-prone area:
+DogStatsD replay", `e88d04b10a`, +1765 LOC, validated only with `cargo check`). No
+Antithesis assertion exists (existing-assertions.md).
+
+## What the code does
+`lib/saluki-components/src/sources/dogstatsd/replay/reader.rs`:
+- `from_path` (39-64): `fs::read` → zstd sniff (`has_zstd_magic`, 141-143, checks 4
+  magic bytes) → `zstd::stream::decode_all` (44) → `valid_header` → `file_version`.
+  All fallible steps map to `GenericError` via `?`. zstd errors are caught (45).
+- `read_next` (84-104): bounds-checks `offset + LENGTH_PREFIX_SIZE > len` → `Ok(None)`
+  (85-87); reads 4-byte LE length prefix; **`size_bytes.try_into().expect("length
+  prefix is 4 bytes")`** (90) — this `expect` is provably safe because the slice is
+  exactly `[offset .. offset+4]` after the bounds check, so it can never fire on any
+  input; bounds-checks `offset + size > len` → `Ok(None)` (95); `UnixDogstatsdMsg::decode`
+  returns mapped error (99-100).
+- `read_state` (110-138): symmetric, with the same provably-safe `expect` at 121 and a
+  real error return at 126-131 if the trailer size overruns.
+
+`bin/agent-data-plane/src/cli/dogstatsd.rs` (the **driver**, runs inside an ADP process):
+- 269-270: `from_path(&cmd.replay_file_path)?` then `read_state()?` (config-check-style path).
+- 355: `from_path(replay_file_path)?`; 367: `let msg = match reader.read_next()? { … }`
+  inside `replay_one_iteration` — all errors propagate via `?` up to a `tokio::select!`
+  (341-347) that returns the error. No `unwrap`/`expect` on the reader results here.
+
+## Failure scenario (Antithesis)
+Feed the `agent-data-plane dogstatsd replay` subcommand a capture file that is:
+arbitrary bytes; a valid header followed by a truncated/garbage protobuf; a zstd stream
+that decompresses to a header + corrupt body; a length prefix that decodes but whose body
+is invalid protobuf; a zstd bomb / partial zstd frame. Expectation: the process exits with
+a clean `Err` (non-zero exit, logged), never a panic/abort/SIGABRT.
+
+## Key observations
+- The two non-test `expect` sites (reader.rs:90, 121) are guarded by an exact-length
+  bounds check, so they are **not** reachable panic sites — the "~25 unwrap/expect" figure
+  in the brief is the whole-file count and is dominated by test code (26 of 28 are in
+  `#[cfg(test)]`). The real untrusted-input panic surface in the reader is small.
+- The realistic panic risk is in the *dependencies*: `zstd::stream::decode_all` on a
+  malicious stream (memory blowup / library panic) and `prost`'s `Message::decode` on
+  adversarial protobuf (recursion/length). Both are wrapped in `Result`, but a panic
+  inside them would still abort — Antithesis is the right tool to find such a panic.
+- A panic is invisible to a workload-only checker if the replay runs as a subprocess that
+  is expected to exit non-zero anyway; distinguishing "clean error exit" from "panic/abort"
+  needs SUT-side signal.
+
+## Config deps
+- Replay path is gated on the `dogstatsd replay` CLI subcommand + a UDS listener; Linux-only
+  (`#[cfg(target_os = "linux")]`, dogstatsd.rs:351). The capture file path is operator-supplied.
+- zstd decompression is auto-selected by magic bytes — no config flag needed to reach it.
+
+## Suggested assertion (MISSING — net-new)
+- **Unreachable** at any panic/abort originating from the reader or its decode calls. Best
+  realized as an SDK `assert_unreachable` in a panic hook installed for the replay path, or
+  by treating any SIGABRT/panic-unwind during replay as a property violation. The workload
+  cannot cleanly observe a panic from outside, so this needs SUT-side instrumentation.
+- Pair with **Reachable**(replay parse executed) so the test confirms the path was actually
+  exercised, not skipped because the subcommand never ran.
+
+## SUT-side instrumentation needs
+- A panic hook (or `assert_unreachable`) on the replay code path is required; a workload-only
+  check sees only an exit code and cannot distinguish panic from a deliberate `Err`.
+- Optionally anchor `assert_always(result.is_ok() || clean_err)` at dogstatsd.rs:367 so an
+  unexpected panic in `read_next` is converted into a recorded assertion failure.
+
+## Open questions
+- (none remaining — see Investigation Log)
+
+### Investigation Log
+
+#### How is replay invoked; whole-file read OOM vector; zstd decompression-bomb vector
+- **Examined:** `bin/agent-data-plane/src/cli/dogstatsd.rs:38-114` (subcommand defs),
+  `:169-213` (`handle_dogstatsd_command` dispatch), `:261-310` (`handle_dogstatsd_replay`),
+  `:322-399` (`run_dogstatsd_replay` / `replay_one_iteration`); and
+  `lib/saluki-components/src/sources/dogstatsd/replay/reader.rs:34-143`.
+- **Found:**
+  - **(a) Separate process, sends over UDS to the running data-plane.** Replay is the
+    `dogstatsd replay` argh subcommand (`ReplayCommand`, dogstatsd.rs:103-114), dispatched at
+    `dogstatsd.rs:192-211` to `handle_dogstatsd_replay`. The CLI process itself opens the capture
+    file and reads records, then **sends each record as a UDS datagram to the already-running ADP**
+    via `uds_sendmsg_with_creds(socket, &msg.payload, &credentials)` (`dogstatsd.rs:394`); the
+    socket target is ADP's configured `dogstatsd_socket` (`dogstatsd_socket_path`, :313). So parsing
+    of the capture file happens **in the replay CLI process, not in the data-plane process.** A
+    panic during parsing aborts the replay tool (exit/SIGABRT), not the data-plane.
+    - Consequence for instrumentation: a panic-catch / `assert_unreachable` for malformed-capture
+      parsing belongs in the **replay CLI process** (the `from_path`/`read_next`/`read_state` call
+      sites in `dogstatsd.rs` and `reader.rs`). The data-plane process only ever sees the resulting
+      *bytes* of `msg.payload` arriving over the DSD UDS socket — i.e. ordinary DSD packets, which
+      are covered by the malformed-dsd-no-crash property, not by the capture parser.
+    - Note `from_path` is invoked **twice per replay**: once eagerly in `handle_dogstatsd_replay`
+      (dogstatsd.rs:269 for `read_state`) and again per loop iteration in `replay_one_iteration`
+      (dogstatsd.rs:355). Both propagate parse errors via `?`; no `unwrap`/`expect` on reader output.
+  - **(b) Whole-file `fs::read` with NO size guard — OOM vector confirmed.** `reader.rs:40-41`:
+    `let raw = fs::read(path).map_err(...)?;` reads the *entire* file into a `Vec<u8>` before any
+    parsing or size check. There is no stat/metadata length check, no max-size constant, no
+    streaming. A multi-GB capture path is loaded fully into memory in the replay process. This is an
+    OOM vector independent of parsing correctness. (Lives in the replay CLI process per (a).)
+  - **(c) `zstd::stream::decode_all` on untrusted input with NO decompressed-size cap —
+    decompression-bomb vector confirmed.** `reader.rs:43-48`: if `has_zstd_magic(&raw)` (4-byte magic
+    sniff, `reader.rs:141-143`), it calls `zstd::stream::decode_all(raw.as_slice())` (`reader.rs:44`)
+    and stores the full decompressed output in `contents`. There is no `Decoder` with a window/size
+    limit, no streaming bound, no cap on decompressed length — `decode_all` allocates as large as the
+    stream dictates. A small crafted `.dog.zstd` can expand to an arbitrarily large `Vec<u8>`. Errors
+    from `decode_all` are caught (`.map_err(...)?`, reader.rs:45), so a *malformed* stream returns a
+    clean `Err`; the hazard is specifically unbounded *memory*, not a panic, on a *valid but huge*
+    decompression.
+- **Not found:** No file-size limit, no `fs::metadata` length pre-check, no zstd window/size cap,
+  no streaming reader. No panic-prone `unwrap`/`expect` on untrusted bytes in the reader (the two
+  non-test `expect`s at reader.rs:90/121 are guarded by exact-length bounds checks, as previously
+  noted).
+- **Conclusion:** RESOLVED. (a) Replay is a separate CLI process that parses the capture and forwards
+  payloads to the running ADP over the DSD UDS socket — panic/assert instrumentation for capture
+  parsing belongs in the replay process. (b) and (c) are both LIVE resource-exhaustion vectors:
+  unbounded `fs::read` (reader.rs:40) and uncapped `zstd::stream::decode_all` (reader.rs:44). These
+  are OOM/decompression-bomb hazards (memory-bound family), distinct from the no-panic property; they
+  warrant either a size cap in the reader or an explicit resource-exhaustion property. The no-panic
+  property itself stands and its panic surface is the zstd/prost decode calls, not the reader's own
+  bounds-checked logic.
diff --git a/test/antithesis/scratchbook/properties/retry-queue-bounded-under-outage.md b/test/antithesis/scratchbook/properties/retry-queue-bounded-under-outage.md
new file mode 100644
index 00000000000..1152b04834e
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/retry-queue-bounded-under-outage.md
@@ -0,0 +1,150 @@
+# retry-queue-bounded-under-outage
+
+**Family:** Resource Boundaries — queues / backpressure / exhaustion
+**Status:** Verified against code at commit 042f41db3b. The byte-cap + drop-oldest invariant is
+**expected to HOLD** (it is genuinely enforced). The tension is that staying bounded *implies
+silent data loss* on prolonged outage — both halves are properties worth asserting.
+
+## What led to the property
+
+The headline guarantee is "won't crash under load, losing customer data," but `sut-analysis.md`
+§2/§5 (Liveness 4) flags a real tension: the forwarder retry queue caps memory, which means a
+*prolonged* intake outage forces drops. This property pins down the safety half precisely: under
+a sustained outage the in-memory + disk retry queue stays within configured byte caps and
+overflow drops the **oldest** entries (bias to freshest data), always **counted**, never growing
+unbounded. The existing suite never tests intake-down at the system level (§6 gap 2).
+
+## Behavior in code
+
+Egress is `TransactionForwarder` (`lib/saluki-components/src/common/datadog/io.rs`); each resolved
+endpoint owns a `PendingTransactions` (two-tier: high-priority in-memory `VecDeque` for fresh
+data + low-priority `RetryQueue` for retries/overflow). Under outage the retry circuit breaker
+re-enqueues to the low-priority queue, so the `RetryQueue` is where unbounded growth would happen.
+
+**In-memory cap with drop-oldest** — `RetryQueue::push`
+(`lib/saluki-io/src/net/util/retry/queue/mod.rs:179-220`):
+```rust
+if current_entry_size > self.max_in_memory_bytes {            // single entry too big => Err
+    return Err(generic_error!("Entry too large to fit into retry queue. (...)"));
+}
+while !self.pending.is_empty()
+      && self.total_in_memory_bytes + current_entry_size > self.max_in_memory_bytes {
+    let oldest_entry = self.pending.pop_front()...;            // OLDEST first
+    if let Some(persisted) = &mut self.persisted_pending {
+        persisted.push(oldest_entry).await?;                  // spill to disk if enabled
+    } else {
+        push_result.track_dropped_item(&oldest_entry);        // else DROP + COUNT
+    }
+    self.total_in_memory_bytes -= oldest_entry_size;
+}
+self.pending.push_back(entry);
+self.total_in_memory_bytes += current_entry_size;
+```
+So `total_in_memory_bytes` is held at/under `max_in_memory_bytes` by evicting oldest-first; with
+no disk persistence the evicted entries are dropped and tracked in `PushResult`.
+
+**Disk cap with drop-oldest** — `PersistedQueue` enforces a disk-byte limit, evicting oldest on
+overflow (`retry/queue/persisted.rs:285-330`): `while !entries.is_empty() && total_on_disk_bytes
++ required_bytes > limit { ... track_dropped_item(...); entries_dropped += 1; }`. The limit is
+`min(max_on_disk_bytes, total_space * storage_max_disk_ratio)` (`persisted.rs:343-349`,
+ratio default 0.8). Corrupt/unconsumable persisted entries are also permanently dropped and
+counted (`persisted.rs:234-238, 317-321`).
+
+**Drops are surfaced as telemetry** — `PushResult` (`mod.rs:49-78`) carries
+`items_dropped`/`events_dropped`/`data_points_dropped`; every push site routes it through
+`track_queue_drops` (`io.rs:535-539`, called at `io.rs:420, 471, 498`), and persisted-entry drops
+flow via `take_persisted_entries_dropped` => `low_prio_queue_entries_dropped` (`io.rs:739-744`).
+The `#[must_use]` on `PushResult` (`mod.rs:49`) makes dropped-item info hard to ignore. So drops
+are counted, not fully silent — but they are still **data loss** with only a counter to show it.
+
+## Failure scenario (Antithesis)
+
+Drive sustained load into DSD while Antithesis holds the mock intake down (connection refused /
+black-hole / 5xx storm / slow) long enough for the retry queue to saturate. Assert:
+1. **Safety:** `total_in_memory_bytes <= max_in_memory_bytes` and on-disk bytes `<= disk limit`
+   at all times — the queue never grows unbounded (`Always`).
+2. **Liveness/loss:** `Sometimes(items_dropped > 0)` once saturated — proves the bound is real and
+   the drop path is exercised (the data-loss reality of the guarantee).
+3. **Recovery (cross-ref, separate property):** after the outage clears, queued data drains
+   high-priority-first. Antithesis interleaving stresses the shared circuit-breaker backoff +
+   per-endpoint queues that the deterministic harness never exercises.
+
+## Suggested assertions (NET-NEW — see existing-assertions.md: NO SDK assertions exist)
+
+- `Always(total_in_memory_bytes <= max_in_memory_bytes)` in `RetryQueue::push`/after-push; and
+  the analogous disk-bytes `Always` in `PersistedQueue::push`. Safety: must hold every check; the
+  eviction `while` loops make this a true invariant, so a real `Always`.
+- `Sometimes(push_result.items_dropped > 0)` — proves saturation/eviction is reached (the
+  bound is actually load-bearing, not vacuous). Liveness/progress.
+- `Sometimes(persisted_entries_dropped > 0)` when disk persistence is enabled — proves the disk
+  cap also evicts. Optional path => Sometimes, not Always.
+- Consider `Reachable` on the "entry too large to ever fit" `Err` branch (`mod.rs:184-189`) if
+  the workload can produce an oversized payload — a distinct failure mode (the whole entry is
+  rejected, not evicted).
+
+SUT-side beats workload-only: a workload checker at the mock intake sees *which* metrics never
+arrive but cannot distinguish "dropped by retry-queue overflow" from "dropped as 400/401/403/413
+permanent failure" (`classifier/http.rs`) from "still queued." The byte-bound invariant is only
+observable from inside `RetryQueue`.
+
+## Configuration dependencies
+
+- `forwarder_retry_queue_payloads_max_size` / `forwarder_retry_queue_max_size` => in-memory byte
+  cap; default **15 MiB** when both unset (`retry.rs:97-104, 160-166`,
+  `FORWARDER_RETRY_QUEUE_PAYLOADS_MAX_SIZE_BYTES`).
+- `forwarder_storage_max_size_in_bytes` (disk cap) default **0 => disk persistence DISABLED**
+  (`retry.rs:39-41, 110-113, 169-171`; gated at `io.rs:394`). So by default overflow goes
+  straight to **drop**, not disk. Disk path only active when operator sets a nonzero value (and
+  `forwarder_storage_path`).
+- `forwarder_storage_max_disk_ratio` (default 0.8) caps disk usage relative to total volume space.
+- Per-endpoint: each resolved endpoint has its own queue, so the *aggregate* memory bound is
+  `num_endpoints * max_in_memory_bytes` (+ disk). Multi-endpoint fan-out multiplies the cap.
+
+## Open questions
+
+- Is the aggregate bound across all endpoints (`num_endpoints * 15 MiB` in-memory) the right thing
+  to assert, or is the per-endpoint cap sufficient? With many additional endpoints the total can
+  be large; matters for whether "bounded" really protects process RSS.
+- Does `IncomingBytesPerSec` / queue-duration accounting (`io.rs:582-639`) feed any *additional*
+  drop policy (time-based eviction) beyond the byte cap, the way the Datadog Agent's
+  queue_duration_capacity does? If so the bound is byte-AND-time and the assertion must cover both.
+
+### Investigation Log
+
+#### Disk-init fallback byte-cap + corrupt-file wedging under fault injection (2026-05-28)
+
+**Examined:** `lib/saluki-components/src/common/datadog/io.rs:391-410` (queue create + silent
+fallback); `lib/saluki-io/src/net/util/retry/queue/persisted.rs` `pop` (:206-243),
+`refresh_entry_state` (:245-273), `try_deserialize_entry` (:354-398), `push` (:164-199),
+`remove_until_available_space` (:304-330). (Full trace recorded in
+`disk-persisted-retry-survives-restart.md` Investigation Log, 2026-05-28.)
+
+**Found:**
+- **Byte cap holds in the degraded (fallback) mode.** Disk-init failure falls back to
+  `RetryQueue::new(queue_id, config.retry().queue_max_size_bytes())` (io.rs:407) — the same
+  in-memory cap as the non-persisted path (io.rs:391). Degraded mode is just the drop-oldest /
+  drop-not-spill branch of `RetryQueue::push`, so `total_in_memory_bytes <= max_in_memory_bytes`
+  still holds. The disk-path `Always` is vacuous in fallback mode (no disk queue exists), but the
+  in-memory `Always` is preserved.
+- **Durability downgrade is surfaced only as an `error!` log (io.rs:406), no metric** — so a
+  bounded-queue workload that intends to exercise the disk cap must detect the fallback (log-scrape
+  or `assert_unreachable` at io.rs:405) or it will silently be testing the in-memory cap instead.
+- **A corrupt / torn-written persisted file does NOT wedge the queue and does NOT break the disk
+  byte cap.** `pop` drops corrupt entries (warn + `entries_dropped++` + decrement
+  `total_on_disk_bytes`) and `continue`s past them (persisted.rs:227-240); the eviction path does
+  the same (:313-322); unrecognized-named files are skipped during the scan (:255-262). Dropping a
+  corrupt entry *decrements* `total_on_disk_bytes`, so it can only help the cap, never violate it.
+  Writes are non-atomic (`tokio::fs::write` direct to final path, :184, no temp+rename), so a
+  SIGKILL mid-write yields a valid-name/truncated-content file → classified corrupt → dropped on
+  read. Recovery proceeds past any number of such files. The "~47 unwrap/expect" concern: the
+  recovery/eviction hot paths use `Result`-propagating `?`/match on the deserialize and IO error
+  paths, not `unwrap`, so a corrupt file surfaces as a handled `Err`, not a panic.
+
+**Not found:** No path where a corrupt/torn file inflates `total_on_disk_bytes` past the limit or
+halts eviction/recovery; no metric for the fallback downgrade.
+
+**Conclusion (RESOLVED):** The byte-cap `Always` invariant holds under disk-init fallback (in-memory
+cap unchanged) and under corrupt/torn-file fault injection (corrupt entries are dropped and decrement
+the byte total, never wedge the queue). The disk-path `Always` is testable on clean disks and is not
+violated by corrupt files; in fallback mode only the in-memory `Always` applies. Workloads targeting
+the *disk* cap must guard against the silent fallback (log-only, no metric).
diff --git a/test/antithesis/scratchbook/properties/rss-bounded-under-cardinality.md b/test/antithesis/scratchbook/properties/rss-bounded-under-cardinality.md
new file mode 100644
index 00000000000..2aff8853287
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/rss-bounded-under-cardinality.md
@@ -0,0 +1,159 @@
+# rss-bounded-under-cardinality
+
+**Family:** Resource Boundaries — memory
+**Status:** Verified against code at commit 042f41db3b. Property is expected to **FAIL by design** under default configuration.
+
+## What led to the property
+
+The ADP headline guarantee (Confluence, "What Comes After DogStatsD") is **"ADP will not crash
+under load, losing customer data,"** and the product is marketed on **deterministic resource
+usage** (`docs/agent-data-plane/index.md:1-6`). The most direct reading of that guarantee is:
+under a high-cardinality tag flood (or a single metric with many distinct timestamped values),
+the process RSS stays within the operator's configured memory grant. This property tests that
+runtime claim directly — not the static startup bound.
+
+## Why it is expected to fail (the design gaps)
+
+The runtime memory story is built from several independently-weak mechanisms; under default
+config most are off or advisory:
+
+1. **Memory limiting is DISABLED by default.** `MemoryMode::default() == Disabled`
+   (`lib/saluki-app/src/accounting.rs:37-40`). In `initialize_memory_bounds`
+   (`accounting.rs:149-181`), `Disabled` => `limiter_grant = None` => `MemoryLimiter::noop()`
+   (`accounting.rs:174-178`). A noop limiter's `wait_for_capacity` returns immediately
+   (`lib/resource-accounting/src/limiter.rs:73-77, 83-88`). So out of the box there is **no
+   runtime backpressure at all**. A limit only auto-appears via cgroups when `DOCKER_DD_AGENT`
+   is set (`accounting.rs:108-121`).
+
+2. **Even when enabled, backpressure is advisory and tiny.** `MemoryLimiter::new`
+   (`limiter.rs:42-68`) starts backoff at **95% of limit** (`backoff_threshold = 0.95`,
+   line 47) and caps sleep at **25ms** (`backoff_max = Duration::from_millis(25)`, line 49).
+   The checker thread samples RSS only every **250ms** (`limiter.rs:120`). Under a burst, RSS
+   can blow well past the limit between samples while the worst penalty any cooperating task
+   pays is a 25ms sleep.
+
+3. **Only the source cooperates; the allocating hot path does not.** Backpressure is
+   cooperative — it only throttles tasks that call `wait_for_capacity`. The DSD source calls it
+   once per read loop iteration (`lib/saluki-components/src/sources/dogstatsd/mod.rs:1186`). But
+   the **aggregate transform never references the memory limiter at all** (grep of
+   `transforms/aggregate/mod.rs` for `wait_for_capacity`/`memory_limiter` => no matches). The
+   aggregation `HashMap<Context, AggregatedMetric>` and the string interner both grow under
+   pressure regardless of backoff; throttling the socket read does not stop the map from
+   growing once packets are in flight.
+
+4. **The interner spills to the heap by default => "effectively unlimited."** Context
+   resolution interns into a fixed-size buffer, but on a full interner it falls back to a heap
+   allocation when `allow_heap_allocations` is true (`lib/saluki-context/src/resolver.rs:339-353`).
+   The builder defaults this to `true` (`resolver.rs:258` `unwrap_or(true)`, doc `:186-190`),
+   and the DSD config default `dogstatsd_allow_context_heap_allocs` is also `true`
+   (`sources/dogstatsd/mod.rs:149-151, 402-406`), wired through
+   `sources/dogstatsd/resolver.rs:38,56,64`. The doc string is explicit: heap fallback means
+   the interner memory is **"effectively unlimited"** (`resolver.rs:182-184`).
+
+5. **The declared firm bound is known-incomplete.** `AggregateConfiguration::specify_bounds`
+   carries a TODO (`transforms/aggregate/mod.rs:247-272`) admitting that a single metric with
+   **many distinct timestamped values** is not accounted for — only `aggregate_context_limit`
+   entries of `sizeof(Context)+sizeof(AggregatedMetric)` are summed. A many-distinct-timestamp
+   flood inflates each `MetricValues` `SmallVec` beyond the modeled size. So even the static
+   bound diverges from real RSS under exactly the workload this property injects.
+
+Net: the static `BoundsVerifier` is a **startup assertion, not a runtime invariant**, and the
+runtime mechanism is off-by-default / advisory / non-cooperative on the path that actually
+allocates. RSS staying within grant is not guaranteed under high-cardinality load.
+
+## Failure scenario (Antithesis)
+
+Deploy ADP with `memory_mode: strict` (or `permissive`) and an explicit `memory_limit`, plus a
+load generator (millstone-style) that emits a high-cardinality tag flood and/or a single metric
+name with many distinct timestamped values into DogStatsD. Sample real process RSS. Expect RSS
+to exceed the grant (interner heap spill + unaccounted timestamped values + aggregate map
+growth that the 25ms/250ms advisory backoff cannot arrest). Antithesis timing/scheduling
+exploration makes the 250ms-sample / burst race observable in a way the deterministic
+correctness harness (fixed clock, healthy intake) cannot.
+
+## Suggested assertion (NET-NEW — see existing-assertions.md: NO SDK assertions exist anywhere)
+
+- A workload-side or SUT-side check reading `Querier::resident_set_size()` (the same source the
+  limiter uses, `limiter.rs:44,100-102`): `Always(actual_rss <= grant.effective_limit_bytes() *
+  tolerance)`. Honest framing: this assertion is **expected to fail** under default config and
+  under high-cardinality load even with the limiter on — that failure is the finding.
+- **SUT-side instrumentation beats a workload-only checker here.** A `Sometimes(backoff_applied
+  && rss_still_climbing)` anchored in the limiter, plus an assertion that the aggregate insert
+  path observes capacity, would localize *why* RSS escapes (advisory-only, wrong path
+  cooperating) rather than just that it does. A pure workload RSS probe sees the symptom, not
+  the mechanism.
+
+## Configuration dependencies
+
+- `memory_mode` (default `disabled` => no limiter), `memory_limit` (default unset),
+  `enable_global_limiter` (default true, but moot when mode is disabled),
+  `memory_slop_factor` (default 0.25).
+- `dogstatsd_allow_context_heap_allocs` (default true => unbounded interner spill).
+- `dogstatsd_string_interner_size_bytes` / `dogstatsd_string_interner_size` (interner capacity;
+  default 2 MiB).
+- `aggregate_context_limit` (default 1,000,000) bounds map entries but not per-entry value count.
+
+## Open questions
+
+- What RSS tolerance band over `effective_limit_bytes` is "acceptable"? The 95%-threshold +
+  25ms-backoff + 250ms-sample design implies meaningful overshoot is expected even in the happy
+  case; the assertion threshold determines whether this reports as a real violation or noise.
+  Matters because too-tight a bound makes the property flap; too-loose hides the design gap.
+- With `allow_context_heap_allocs=false`, does RSS actually become bounded, or do the aggregate
+  map and per-context value `SmallVec`s still escape the grant under the many-timestamp flood?
+  This decides whether the property is "fails always" or "fails only with heap spill enabled."
+- Is there any RSS ceiling enforced by the container/cgroup that would OOM-kill ADP before the
+  assertion fires? If so the observable outcome is a process crash (a different, arguably worse,
+  violation of the same guarantee) rather than an RSS-over-grant reading.
+
+## Investigation Log
+
+#### What enables a memory limit by default; does cgroup auto-detection require `DOCKER_DD_AGENT`?
+- **Examined**: `lib/saluki-app/src/accounting.rs`: `MemoryBoundsConfiguration` (55-94),
+  `try_from_config` (101-130, the cgroup auto-detect block at 107-121), `get_initial_grant`
+  (133-138), `initialize_memory_bounds` (149-181); `MemoryMode` default
+  (`memory_mode` serde `default` => `MemoryMode::Disabled`, fields/doc near 90-94);
+  call site `bin/agent-data-plane/src/cli/run.rs:205-206,239`; the `DOCKER_DD_AGENT` reference in
+  `components/apm_onboarding/install_info.rs:90`.
+- **Found — cgroup auto-detect is gated on `DOCKER_DD_AGENT`**: In `try_from_config`, cgroup
+  detection runs **only if** `config.memory_limit.is_none()` AND
+  `env::var("DOCKER_DD_AGENT")` is `Ok` AND its value is non-empty (107-110). Only then does
+  `CgroupMemoryParser.parse()` run and, on success, set `config.memory_limit` (111-118). So with
+  no explicit `memory_limit` config and no non-empty `DOCKER_DD_AGENT`, `memory_limit` stays
+  `None`.
+- **Found — two independent gates make the limiter a no-op by default**:
+  (1) `memory_mode` defaults to `MemoryMode::Disabled`; in `initialize_memory_bounds`, Disabled
+  logs "Memory limiting disabled." and yields `None` grant (158-161). (2) Even in
+  Permissive/Strict, a `None` `memory_limit` logs "No memory limit set ... Skipping memory bounds
+  verification." and yields `None` (167-170). A `None` grant => `MemoryLimiter::noop()` (174-178).
+- **Not found**: Any default config shipping `memory_mode != disabled` or a default
+  `memory_limit`; any auto-detect path that doesn't require `DOCKER_DD_AGENT`.
+- **Conclusion**: RESOLVED. By default there is **no enforced memory limit**: `memory_mode` is
+  `Disabled` and `memory_limit` is unset. A limit becomes active only if the operator sets
+  `memory_limit` (or `memory_mode` Permissive/Strict + a limit) explicitly, OR a non-empty
+  `DOCKER_DD_AGENT` env var is present AND cgroup parsing succeeds (which only populates
+  `memory_limit`, still requiring a non-Disabled `memory_mode` to actually limit). For the
+  Antithesis run this means: unless the harness sets `DOCKER_DD_AGENT` (non-empty) with a cgroup
+  memory limit, OR sets `memory_mode`+`memory_limit`, the limiter is a noop and RSS is unbounded by
+  ADP itself — confirming this property FAILs by design under default config. The harness should
+  verify whether its container sets `DOCKER_DD_AGENT`, since that silently changes the baseline.
+
+#### With `allow_context_heap_allocs=false`, does RSS actually become bounded, or do per-value `SmallVec`s still escape? `(partial)`
+
+- **Examined**: `transforms/aggregate/mod.rs` `AggregationState` insert/cap (`:566-571`) and
+  `specify_bounds` (`:247-273`, incl. the self-documented gap comment `:249-256`); the
+  `aggregate-context-limit-enforced` finding that the live context count is bounded to exactly
+  `context_limit`; `MetricValues` storage for multi-value / many-timestamp metrics.
+- **Found**: the *context count* is firmly bounded to `context_limit` (new contexts over the cap are
+  dropped) — confirmed. But the declared firm bound is `context_limit * (sizeof(Context) +
+  sizeof(AggregatedMetric))`, which `specify_bounds` itself admits does **not** account for a single
+  metric carrying many distinct timestamped values (per-value `SmallVec`/sketch-sample growth). So
+  bounding the context count does **not** bound per-value memory.
+- **Not found**: a measured figure for how much per-value memory a single many-timestamp context can
+  consume under `allow_context_heap_allocs=false` — i.e. whether interner-bounded mode actually caps
+  total RSS or merely caps the number of contexts while per-context value vectors grow on the heap.
+  This needs a runtime measurement (a workload feeding one context thousands of distinct timestamps
+  and reading RSS), not a static read.
+- **Conclusion**: `(partial)` — context count is bounded exactly; per-value memory is unaccounted and
+  remains an empirical question for the workload to settle. Does not change the property statement
+  (RSS-within-grant), only sharpens *why* it may still fail even with heap allocs disabled.
diff --git a/test/antithesis/scratchbook/properties/shutdown-drains-no-loss.md b/test/antithesis/scratchbook/properties/shutdown-drains-no-loss.md
new file mode 100644
index 00000000000..a72e605e938
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/shutdown-drains-no-loss.md
@@ -0,0 +1,108 @@
+---
+slug: shutdown-drains-no-loss
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Liveness (with a safety boundary at the 30s timeout)
+priority: High
+assertion_status: MISSING (net-new instrumentation)
+---
+
+# Property: Events accepted before the shutdown signal are drained to the forwarder before clean shutdown
+
+## Origin
+SUT analysis §5 safety #6 ("Graceful shutdown completes within 30s without forceful kill"),
+docs/reference/architecture/index.md "Shutdown" section, and the §3 note that open aggregate
+windows are dropped on shutdown by default. No Antithesis assertion exists.
+
+## What the code does
+
+### Drain-by-channel-closure design
+`docs/reference/architecture/index.md:196-215`: shutdown signals sources first; sources stop
+intake and finish in-flight work, then signal done. Transforms/destinations are NOT signaled —
+they drain because their input channels close once all upstream senders shut down, then they
+"naturally complete." The doc claims: "we ensure that all remaining events are processed before
+the topology is completely shutdown." This is the drain guarantee under test.
+
+### The 30s grace window + forceful stop
+`bin/agent-data-plane/src/cli/run.rs:289-290`: `running_topology.shutdown_with_timeout(Duration::from_secs(30))`.
+`lib/saluki-core/src/topology/running.rs` `shutdown_with_timeout` (~71-124):
+- `shutdown_coordinator.shutdown()` (~82) triggers source shutdown which cascades downstream.
+- Loops on `component_tasks.join_next_with_id()` until all components stop → `stopped_cleanly`,
+  logs "All components stopped." (~89-97).
+- On `shutdown_timeout` elapse (~110-115): `warn!("Forcefully stopping topology after shutdown grace
+  period.")`, sets `stopped_cleanly = false`, breaks the loop. Components still running are
+  abandoned (the `JoinSet` is dropped) — **in-flight data in not-yet-drained components is lost**.
+- Returns `Ok(())` only if `stopped_cleanly`, else `Err("Topology failed to shutdown cleanly.")` (~119-123).
+
+### Aggregate open-window drop by default
+`lib/saluki-components/src/transforms/aggregate/mod.rs:115-133`: `flush_open_windows`
+(alias `dogstatsd_flush_incomplete_buckets`) **defaults to `false`**. On stop, the current open
+bucket is NOT flushed by default (to avoid double counting across restart). So data in the *current
+open aggregation window* at shutdown is intentionally dropped even on a clean, within-grace shutdown
+— this is a designed loss boundary distinct from the timeout boundary.
+
+## Failure scenario (Antithesis)
+Drive sustained load, then issue the shutdown signal (SIGINT/SIGTERM). Expectation:
+- Every event accepted *before* the signal that has already passed aggregation into a *closed*
+  window/passthrough is forwarded to the mock intake before the process exits cleanly, provided the
+  topology drains within 30s.
+- If the grace window is exceeded (e.g. intake is slow/blocked so the forwarder can't drain), the
+  forceful-stop path is taken and in-flight data is lost — this is the *acceptable* boundary, and the
+  property should assert the *clean* case and merely characterize (not forbid) the timeout case.
+
+## Key observations
+- Two designed loss boundaries make the property conditional, not absolute:
+  1. **Open aggregation window** at shutdown (dropped unless `flush_open_windows=true`).
+  2. **30s timeout exceeded** → forceful stop drops in-flight data.
+  The clean drain claim is: *for events that have exited aggregation into a flushed window (or are
+  passthrough) and given drain completes within 30s, none are lost.*
+- A blocked/slow forwarder (intake down) is exactly what pushes shutdown past 30s — so the
+  forwarder-eventual-delivery and disk-persistence properties interact: with disk persistence on, the
+  shutdown flush persists the retry backlog (no loss); without it, a forceful stop loses it.
+- Backpressure during shutdown: the source stops reading, but already-accepted events in channels must
+  flow through. If any downstream is wedged (e.g. the source-dispatch wedge from
+  source-dispatch-no-misroute), drain stalls and the timeout fires.
+
+## Config deps
+- 30s timeout is hard-coded (`run.rs:290`), not configurable.
+- `aggregate_flush_open_windows` (default false) — toggles whether open-window data is part of the
+  drained set. The assertion's "accepted before signal" set must exclude open-window data unless this
+  is true.
+- Disk persistence (`forwarder_retry_queue_storage_max_size`) — determines whether a forceful-stop /
+  blocked-forwarder shutdown loses the retry backlog or persists it.
+
+## Suggested assertion (MISSING — net-new)
+- **Sometimes(clean-drain-no-loss):** at least once, after a shutdown signal under load with a healthy
+  intake, the topology stops cleanly within 30s AND every accepted-before-signal event that reached a
+  flushed window is observed at the mock intake (reconcile input-before-signal vs received). This is
+  the meaningful progress state proving drain works.
+- **AlwaysOrUnreachable(timeout-implies-forceful):** whenever the 30s timeout fires, the run reports
+  forceful stop (`stopped_cleanly=false` / "Forcefully stopping topology" / `shutdown` returns Err) —
+  i.e. the timeout path is the only way in-flight loss occurs, and it is loudly signaled, never silent.
+  Anchor at `running.rs:110-115`.
+- Do NOT assert an absolute Always-no-loss: the open-window default-drop and the 30s forceful path are
+  designed losses that would make a blanket Always false.
+
+## SUT-side instrumentation needs
+- SDK `assert_reachable` at the clean-stop branch (`running.rs:91` "All components stopped.") to
+  confirm clean shutdown is exercised.
+- SDK `assert_reachable` (characterization, not failure) at the forceful-stop branch
+  (`running.rs:112`) so triage can see when the timeout boundary was hit.
+- Primary check is workload-side reconciliation of the pre-signal input set against the mock intake,
+  excluding open-window data unless `flush_open_windows=true`.
+
+## Open questions
+- **Does the source actually wait for already-read-but-not-yet-dispatched events on shutdown?** The
+  doc says sources "wait for existing work to complete" — confirm the DSD `'read` loop finishes
+  dispatching the current buffer before signaling done, else events accepted just before the signal
+  but still in the source are lost even on a clean shutdown.
+- **Final aggregate flush on stream close:** SUT §5 liveness #1 says aggregate does a final flush on
+  stream close. Confirm that final flush emits *closed* windows (not open) and that those flushed
+  metrics make it through the encoder→forwarder before the 30s deadline under load.
+- **What is the realistic drain time under max load with a healthy intake?** If normal drain
+  approaches 30s, the clean-case assertion is fragile and the timeout boundary becomes the common case
+  — important for sizing the workload and for whether 30s is adequate (a potential finding).
+- **PassthroughBatcher / passthrough_idle_flush_timeout (1s)**: pre-timestamped metrics buffered there
+  at shutdown — are they flushed on stop or dropped? Affects which "accepted before signal" events are
+  in the drained set.
diff --git a/test/antithesis/scratchbook/properties/source-dispatch-no-misroute.md b/test/antithesis/scratchbook/properties/source-dispatch-no-misroute.md
new file mode 100644
index 00000000000..733fa2c2fb8
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/source-dispatch-no-misroute.md
@@ -0,0 +1,107 @@
+---
+slug: source-dispatch-no-misroute
+sut_path: /home/ssm-user/src/saluki
+commit: 042f41db3bd97118c38981765fd49696fce9d318
+updated: 2026-05-28
+type: Safety
+priority: Medium
+assertion_status: MISSING (net-new; likely needs SUT-side Unreachable instrumentation)
+---
+
+# Property: A mid-buffer dispatch failure never mis-routes remaining events across DSD outputs
+
+## Origin
+SUT analysis §7 finding #6 ("Source dispatch errors are logged and swallowed ... a mid-buffer
+dispatch failure can mis-route remaining events (eventd/service-check events leaking into the
+metrics path)"). The TODO at `sources/dogstatsd/mod.rs:1670-1676` explicitly flags this. No
+Antithesis assertion exists.
+
+## What the code does
+`lib/saluki-components/src/sources/dogstatsd/mod.rs` `dispatch_events` (~1667-1716):
+1. TODO (~1670-1676): "if we fail to dispatch the events, we may not have iterated over all of
+   them, so there might still be eventd events when we get to the service checks point, and eventd
+   events and/or service check events when we get to the metrics point."
+2. Eventd path (~1679-1690): if `has_event_type(EventType::EventD)`, `extract(Event::is_eventd)`
+   removes all eventd events from the buffer into an iterator, then `buffered_named("events")`
+   `.expect("events output should always exist")` `.send_all(...)`. On error: logs, returns nothing
+   (function returns `()`), **does not re-insert**.
+3. Service-check path (~1692-1704): same shape with `extract(Event::is_service_check)` →
+   `buffered_named("service_checks").expect(...)`.
+4. Metrics path (~1706-1715): "if there are events left, they'll be metrics" → `dispatch_named("metrics",
+   event_buffer)` with the *remaining* buffer.
+
+### Why the actual misroute is subtler than the TODO implies
+`lib/saluki-core/src/topology/interconnect/event_buffer.rs`:
+- `extract` (~61-88) iterates ALL events, collects matching indices, removes them from the buffer,
+  and **recomputes `seen_event_types` from what remains**. It removes matching events regardless of
+  whether the later send succeeds. So after `extract(is_eventd)` returns, the buffer no longer
+  contains eventd events even before `send_all` runs.
+- `send_all` (`dispatcher.rs:197-206`) consumes the *already-extracted iterator*. If it errors
+  mid-iteration, the events it failed to push are dropped (lost), but they were already removed from
+  `event_buffer`, so they cannot leak into the service_checks or metrics path.
+- Net: with the current `extract`-then-`send_all` ordering, a dispatch *send* failure causes **loss**
+  of the events for that output, NOT misrouting into another output's path. The "leaking into metrics"
+  hazard would require `extract` to leave matching events in the buffer on a send failure — which it
+  does not, because extraction and sending are separate steps.
+
+So the property splits into two distinct claims:
+- **(A) No misroute (Safety):** events of type eventd/service-check never arrive at the `metrics`
+  output, and vice versa. With current code this should hold structurally (extract is by predicate,
+  recomputed types), but the TODO documents authorial uncertainty and `.expect()` on outputs is a
+  crash if an output is ever missing.
+- **(B) No silent loss on dispatch failure (related, overlaps no-silent-interconnect-drop):** a
+  `send_all`/`dispatch_named` error here is logged and the extracted events are dropped, with the
+  TODO noting the component will "continue to fail to dispatch ... until the process is restarted."
+
+## Failure scenario (Antithesis)
+Drive a mixed buffer (eventd + service_check + metric events) while forcing a downstream output to
+error mid-dispatch (e.g. close/saturate the events or service_checks downstream so `send_all`
+errors). Observe at the mock intake / per-output telemetry that:
+- no eventd or service-check payload appears on the metrics encode/forward path (misroute = false);
+- the events that failed to dispatch are accounted as failures, not silently mixed elsewhere.
+
+## Key observations
+- `.expect("events output should always exist")` (~1684) and `.expect("service checks output should
+  always exist")` (~1698) are crash points if those named outputs are ever unwired — a separate
+  liveness/crash hazard on this path.
+- A send error in the eventd step returns control but the function continues? No — on error it only
+  logs inside the `if let Err` and falls through to the next `if` block; it does not early-return.
+  So after an eventd send failure it still attempts service_checks then metrics with the remaining
+  (eventd-free) buffer. This is the partial-iteration concern, but because eventd events were already
+  extracted out, the metrics dispatch gets only non-eventd, non-service-check events.
+
+## Config deps
+- DSD source must have all three named outputs (`metrics`, `events`, `service_checks`) wired — they
+  are in the production blueprint (SUT §2). If a deployment omits one, the `.expect` panics.
+
+## Suggested assertion (MISSING — net-new, SUT-side likely required)
+- **Unreachable(misroute):** an eventd or service-check `Event` reaching the metrics output path, or a
+  metric reaching events/service_checks. Best as an SDK `assert_unreachable` at the point where the
+  metrics dispatch buffer is assembled (`mod.rs:1706-1715`) checking that no remaining event
+  `is_eventd() || is_service_check()`. This directly encodes the misroute-must-never state and would
+  fire if a future refactor breaks `extract`/type-recompute. SUT-side instrumentation needed because
+  the routing decision is internal and not observable from telemetry alone.
+- **AlwaysOrUnreachable(dispatch-failure-counted):** when `send_all`/`dispatch_named` returns Err on
+  this path, a failure/discard counter increments (no silent swallow). Overlaps the
+  no-silent-interconnect-drop property; here scoped to the source dispatch.
+
+## SUT-side instrumentation needs
+- An `assert_unreachable` checking `event_buffer` contents are metrics-only at the metrics dispatch
+  step (`mod.rs:~1707`). The misroute path is not externally observable, so this must be in-process.
+- Optional `assert_unreachable` guarding the two `.expect(...)` output lookups (~1684/1698) to convert
+  the latent panic into a tracked property if an output is missing.
+
+## Open questions
+- **Can `extract` ever leave a matching event in the buffer?** Reading `event_buffer.rs:61-88`,
+  removal is by collected indices via `swap_remove_back`, with a `pos < self.events.len()` guard
+  (~79). `swap_remove_back` reorders, and indices were collected before removal then applied in
+  reverse — confirm no index aliasing leaves a matching event behind (would be the actual misroute
+  bug the TODO fears). This is the crux: if extraction is correct, misroute is structurally
+  impossible; if not, the assertion catches it.
+- **Is the dropped-on-send-failure data counted anywhere?** The `error!` log (~1688/1702/1713) is the
+  only signal; there may be no counter for "events lost because the source could not dispatch them,"
+  unlike the interconnect `events_discarded_total`. If uncounted, the loss is fully silent — worth a
+  finding and a counter.
+- **Does a persistent downstream failure here wedge the source** ("continue to fail ... until the
+  process is restarted")? If so, this also feeds the fail-stop/shutdown story (slug
+  shutdown-drains-no-loss / supervision §2).
diff --git a/test/antithesis/scratchbook/properties/tag-filterlist-applied-consistently.md b/test/antithesis/scratchbook/properties/tag-filterlist-applied-consistently.md
new file mode 100644
index 00000000000..4ddeacfdc62
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/tag-filterlist-applied-consistently.md
@@ -0,0 +1,105 @@
+# tag-filterlist-applied-consistently
+
+## Origin
+
+Coverage-gap analysis of the DogStatsD transform chain. `tag_filterlist` removes/retains tags
+per-metric and claims Datadog-Agent equivalence, but has two correctness subtleties no property
+covers: (1) it filters **only Counter and sketch** metrics — gauges/rates/sets are deliberately
+**not** filtered (`mod.rs:235-237`); and (2) it serves results from a **context cache** keyed by the
+original context, which must always agree with a fresh filter computation and must be invalidated on
+config reload. A wrong metric-type predicate, or a stale cache entry, silently retains tags the
+operator intended to drop (cardinality/PII leak) or drops tags that should remain.
+
+## Code paths
+
+- `bin/agent-data-plane/src/components/tag_filterlist/mod.rs`
+  - **Type gate:** `if metric.values().is_sketch() || matches!(metric.values(),
+    MetricValues::Counter(_))` (`mod.rs:235-237`). Only sketches + counters are considered; every
+    other metric type bypasses filtering entirely.
+  - **Filter logic:** `filter_metric_tags` (`mod.rs:299-318`) — looks up `filters.get(name)`; on a
+    rule, `retain_tags` + `retain_origin_tags` with `should_keep_tag` (`mod.rs:289-291`,
+    `is_exclude != names.contains(tag.name())`). Filters **both instrumented and origin tags**.
+  - **Context cache:** `context_cache: Cache<Context, Option<(Context, usize)>>`, capacity 100_000
+    (now operator-tunable via the `aggregator_tag_filter_cache_capacity` config key, PR #1771; eviction
+    just forces recompute, not incorrectness), TTI 30s (`mod.rs:40-42,204-214`). Hit path replaces
+    context with the cached filtered context
+    (`mod.rs:240-247`); miss path computes, then caches `None` (NoChange) or `Some((filtered, n))`
+    (`mod.rs:248-263`).
+  - **Compile/merge rules:** `compile_filters` (`mod.rs:111-140`) — same name+action unions tag sets;
+    conflicting actions → **exclude wins**; empty `metric_name` entries are dropped.
+  - **Reload:** on `watch_for_updates("metric_tag_filterlist")` (`mod.rs:222,274-277`) it rebuilds
+    `self.filters` **and** `self.context_cache = build_context_cache()` (cache invalidated on reload —
+    good, but see `filter-config-reload-correct` for the lag/partial/clear hazards that gate whether
+    the *new* filters are even applied).
+  - Agent reference comments cite `pkg/aggregator/time_sampler.go` equivalence for the sibling
+    post-aggregate filter; the tag filter targets per-metric tag stripping.
+
+## Failure scenario
+
+- **Cache divergence:** the cached filtered context for name X disagrees with a fresh
+  `filter_metric_tags(X)` — e.g. a metric whose tagset differs but collides on the cache key, or a
+  cache entry that survives a config reload window. The cached (stale/wrong) tagset is applied,
+  silently producing a different tag set than the current rules dictate.
+- **Type-gate divergence:** a metric type the Agent *would* filter (or would *not*) is treated
+  differently by ADP's Counter+sketch-only gate, so a tag the operator listed is retained on (say) a
+  gauge that ADP never filters — silent cardinality/PII leak; or dropped on a type the Agent leaves
+  alone.
+- **Merge divergence:** include/exclude conflict for the same name resolves differently than the
+  Agent ("exclude wins" + first-exclude-wins, `mod.rs:127-135`), changing which tags survive.
+
+All silent — only `tag_filterlist_*` telemetry counters move; no error, no drop.
+
+## Property
+
+- **Type:** Safety.
+- **Invariant:**
+  - `Always(cache-hit filtered tags == fresh filter_metric_tags result)` — SUT-side: on a cache hit,
+    recompute and assert the cached context's tags equal the freshly-filtered tags. Catches stale/
+    colliding cache entries.
+  - `Always(only Counter and sketch metrics ever have tags removed by this transform)` /
+    `AlwaysOrUnreachable(a gauge/rate/set leaves tag_filterlist with tags identical to input)` —
+    pins the deliberate type gate so a future refactor that widens/narrows it is caught.
+  - Differential (optional, ride Add-on 2): `Always(post-filter (name, tags) within ratio of Agent
+    time-sampler tag filtering)` for the same `metric_tag_filterlist` config and a corpus spanning all
+    metric types — the strongest equivalence check.
+  - `Sometimes(a tag was removed)` + `Sometimes(cache hit served a filtered result)` for non-vacuity.
+- **Antithesis angle:** sustained mixed-type metric load (counters, sketches, gauges, rates, sets)
+  with overlapping names that stress the context cache (eviction at 100_000 / TTI 30s), interleaved
+  with config reloads (compose with `filter-config-reload-correct`) and node-throttling to widen the
+  reload-vs-apply window. Timing exploration surfaces cache entries that outlive the reload that
+  should have invalidated them.
+- **Priority:** Medium (High if the Counter+sketch type gate is found to diverge from the Agent).
+
+## Config dependencies
+
+- `metric_tag_filterlist` (array of `{metric_name, action: include|exclude, tags:[...]}`); set
+  identically on the Agent baseline for the differential facet.
+- Dynamic config enabled (remote-agent mode) only for the reload-interaction facet; the cache-vs-fresh
+  and type-gate invariants can run with a static config in the primary topology.
+- Corpus must include **all** metric value types to exercise the type gate.
+
+## Open Questions
+
+- Does the Datadog Agent restrict tag filtering to the same Counter+sketch subset, or does it filter
+  other metric types too? Pivotal for the type-gate invariant and the differential facet.
+  `(needs Agent-source confirmation)`
+- The context cache is keyed by the **full original `Context`** (`mod.rs:204`); can two metrics with
+  the same name+tags but different *origin tags* collide and get the wrong filtered result, given
+  filtering also touches origin tags (`mod.rs:308`)? Needs a cache-key audit.
+- Cache TTI is 30s; under a config reload the cache is rebuilt (`mod.rs:276`), but only **when the
+  reload event actually fires** — if the reload is Lagged-dropped (`filter-config-reload-correct`
+  Hazard 1) the cache is *not* rebuilt and stale filtered contexts persist up to TTI. Confirm this
+  compound failure.
+- Does "exclude wins on conflict, first-exclude-wins" (`mod.rs:127-135`) match the Agent's merge
+  semantics for duplicate metric-name entries?
+
+### Investigation Log
+
+- Examined: full `tag_filterlist/mod.rs` incl. the type gate (`235-237`), cache hit/miss/insert
+  (`240-263`), `filter_metric_tags`/`should_keep_tag` (`289-318`), `compile_filters` merge rules
+  (`111-140`), and the reload arm rebuilding both `filters` and `context_cache` (`274-277`).
+- Found: a Counter+sketch-only type gate, a 100k-entry / 30s-TTI context cache on the hot path, and
+  exclude-wins merge — all claiming Agent equivalence with only self-consistent unit tests. The
+  cache-vs-fresh and type-gate facets are SUT-side invariants; the equivalence facet rides the
+  differential harness. Distinct from `filter-config-reload-correct` (which owns the *reload
+  mechanism* hazards) — this property owns the *filter application* correctness.
diff --git a/test/antithesis/scratchbook/properties/topology-ready-before-intake.md b/test/antithesis/scratchbook/properties/topology-ready-before-intake.md
new file mode 100644
index 00000000000..bc423f43551
--- /dev/null
+++ b/test/antithesis/scratchbook/properties/topology-ready-before-intake.md
@@ -0,0 +1,122 @@
+---
+slug: topology-ready-before-intake
+title: Topology becomes ready before data intake begins
+family: Lifecycle Transitions & Configuration
+type: Liveness + Safety (ordering)
+priority: High
+status: assertion-missing
+sut_commit: 042f41db3bd97118c38981765fd49696fce9d318
+---
+
+# topology-ready-before-intake
+
+## Origin
+
+SUT analysis §2 ("Build → spawn lifecycle … The topology starts accepting data only
+after `health_registry.all_ready()`") and §5 Liveness #3 ("Topology starts accepting
+data only after all components report ready"). Owned by the Lifecycle agent.
+
+## Files / functions / lines
+
+- `bin/agent-data-plane/src/cli/run.rs:218-251` — startup ordering:
+  - `run.rs:219-225`: internal supervisor is spawned (`run_with_shutdown`).
+  - `run.rs:227-235`: `select!` waits on `health_registry.all_ready()` for the *internal
+    supervisor* to become healthy before proceeding. If the supervisor task completes first
+    (`early_result`), returns an error (`generic_error!("Internal supervisor completed
+    unexpectedly…")`).
+  - `run.rs:238`: `let built_topology = blueprint.build().await?;` — topology is only **built**
+    after the internal supervisor is ready.
+  - `run.rs:239`: `let mut running_topology = built_topology.spawn(&health_registry, memory_limiter).await?;`
+    — components (incl. the `dsd_in` source that opens listeners) are spawned here.
+  - `run.rs:242-251`: a *second* `all_ready()` wait, spawned as a detached task, logs
+    `topology_ready_ms` and emits startup metrics once the full topology reports ready.
+- `lib/saluki-core/src/health/mod.rs:354-375` — `HealthRegistry::all_ready()` / `check_all_ready()`:
+  resolves only when **every** registered component's `is_ready()` is true
+  (`shared.ready` AND `health != Dead`). Empty registry resolves immediately.
+- `lib/saluki-core/src/health/mod.rs:49-66` — `Health::mark_ready()` flips the per-component
+  `ready` atomic and `notify_waiters()` so `all_ready()` re-checks.
+- `lib/saluki-core/src/topology/built.rs:158-410` — `BuiltTopology::spawn` wires interconnects
+  and spawns each component into a `JoinSet` (`spawn_component`, ~666-687). A source only begins
+  reading sockets after its own task runs and marks itself ready.
+
+## Key observation / honest framing
+
+The ordering guarantee is **subtle and only partial** as literally stated by the SUT analysis.
+What the code actually does:
+
+1. The DSD source's listeners are bound and accept loop starts when the **source component task**
+   runs (after `spawn()` at run.rs:239), independent of whether *other* components (e.g.
+   `dsd_agg`, `dd_out`) have marked ready.
+2. The data path is gated not by an explicit "do not read until all_ready" check in the source,
+   but by **backpressure**: bounded `mpsc` edges + `memory_limiter.wait_for_capacity().await`
+   (SUT §4). A source that reads before downstream is ready will block on `Dispatcher::send`.
+3. The *observable* "topology ready" signal (`topology_ready_ms` log, startup metrics at
+   run.rs:244-251) fires only after `all_ready()`.
+
+So the precise, defensible property is **ordering of readiness milestones**, not "zero bytes
+read before all_ready". A truthful assertion:
+
+- **Always / ordering:** the `topology_ready_ms` milestone (full `all_ready`) is reached, and the
+  internal-supervisor `all_ready` (run.rs:229) always precedes `blueprint.build()` (run.rs:238).
+  i.e. the topology is never **built/spawned** before the internal supervisor reports ready.
+- **Sometimes(all_ready reached):** at least once the full topology reaches `all_ready` (the
+  detached task at run.rs:242 logs `topology_ready_ms`) — proves the readiness path is live.
+
+Overclaiming "no data processed pre-ready" would be **wrong**: the source can read and then block
+on backpressure; data may sit in-flight in channels before downstream `all_ready`. Frame the
+assertion around the readiness-milestone ordering + reaching ready, not byte-level intake gating.
+
+## Failure scenario (Antithesis angle)
+
+- Delay/stall a downstream component's readiness (e.g. forwarder `dd_out` blocked on a slow/dead
+  mock intake during init, or aggregate slow to initialize). Verify the process either reaches
+  `all_ready` eventually (liveness) or stays observably "Waiting for topology to become healthy"
+  without crashing.
+- Fault: internal supervisor child fails to initialize → run.rs:232 `early_result` branch returns
+  error before topology is built. Assert the topology was **never spawned** in that case
+  (Unreachable: "topology spawned after internal-supervisor init failure").
+- Timing: with Antithesis scheduling, confirm `sup_ready_ms` (run.rs:230) is logged before any
+  `topology_ready_ms` (run.rs:245).
+
+## Config dependencies
+
+- `data_plane.enabled` / `dp_config.enabled()` (run.rs:152) — if not enabled, ADP exits before
+  building topology (no readiness milestones at all). Assertions must not fire in that case.
+- Pipelines enabled (`dp_config.*_pipeline_required`, run.rs:414-457) determine which components
+  register with the health registry, i.e. what `all_ready` is gating on.
+- Memory mode (`MemoryMode::default()==Disabled`, SUT §7) affects `memory_limiter`; doesn't change
+  ordering but affects backpressure behavior.
+
+## Assertion (MISSING — net-new instrumentation)
+
+No Antithesis SDK assertions exist (existing-assertions.md). Proposed SUT-side instrumentation in
+`run.rs`:
+- After run.rs:230 (internal supervisor healthy) and before run.rs:238: record a monotonic
+  milestone flag `internal_supervisor_ready`.
+- Inside the detached task at run.rs:243 after `all_ready()`: `assert_reachable!` /
+  `Sometimes("topology reached all_ready")` and `assert_always!`/`Always("internal supervisor
+  ready before topology build", internal_supervisor_ready == true)`.
+- For the negative path: at run.rs:232 (`early_result`) before returning the error, the code never
+  reaches `spawn()`; an `assert_unreachable!` placed *after* `spawn()` keyed on
+  "internal_supervisor_failed_to_initialize" would be hard to express in-process — better tested
+  via workload-side log assertion (`port_listening` should be false / no `topology_ready_ms` log).
+
+## Open questions
+
+- Does the `dsd_in` source mark itself ready *before* or *after* binding its listeners? If listeners
+  bind during `initialize()` (before `mark_ready`), a client could connect pre-ready. Need to read
+  `lib/saluki-components/src/sources/dogstatsd/mod.rs` listener-bind vs mark_ready ordering to know
+  whether "accepting connections before ready" is observable. WHY IT MATTERS: determines if the
+  truthful property is "ordering of milestones" only, or can be strengthened to "no socket bound
+  before ready". WHAT CHANGES: the assertion strength and whether a workload `port_listening` probe
+  pre-ready is a valid falsification.
+- Is there any scenario where a component marks ready, processes data, then a *later*-registered
+  component drops the aggregate `all_ready` back to false? `register_component` can happen while
+  `all_ready` waits (mod.rs:347-353 docstring). WHY IT MATTERS: readiness is not latched; the
+  milestone could flap. Confirm all data components register before the run.rs:243 wait.
+
+## SUT-side instrumentation needs
+
+- Antithesis SDK dependency must be added (none today).
+- A monotonic `internal_supervisor_ready` flag readable at the topology-ready milestone, or rely on
+  log-ordering assertions (`sup_ready_ms` before `topology_ready_ms`).
diff --git a/test/antithesis/scratchbook/property-catalog.md b/test/antithesis/scratchbook/property-catalog.md
new file mode 100644
index 00000000000..bfcf0f7cbae
--- /dev/null
+++ b/test/antithesis/scratchbook/property-catalog.md
@@ -0,0 +1,723 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence space — headline guarantees and gap analyses that seed properties.
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/pages/6497671050/What+Comes+After+DogStatsD
+    why: "ADP will not crash under load, losing customer data" — root guarantee for the memory + data-loss families.
+  - path: https://datadoghq.atlassian.net/browse/DADP
+    why: ADP Jira project for tracked gaps/regressions.
+  - path: https://github.com/DataDog/saluki/pull/1768
+    why: PR review #4393897611 (Copilot) — priority alignments and the aggregate-panic-fixed-upstream update reconciled here.
+---
+
+# Property Catalog: Agent Data Plane (ADP)
+
+35 properties across 7 categories. The system makes one headline guarantee — **"ADP will not
+crash under load, losing customer data"** — which decomposes into the *Memory & Resource Bounds*
+and *Data Integrity & No Silent Loss* families. The remaining categories cover aggregation
+correctness, lifecycle/config, untrusted-input parsing, concurrency, and **transform & enrichment
+correctness** (Category G, added after evaluation — ADP as a *transformer*, not just a transport).
+
+> **Evaluation note (2026-05-28):** an 4-lens portfolio evaluation added 8 properties (G1 events/
+> service-checks; G2 transform-chain + runtime filter config-reload), applied 9 refinements, and
+> escalated one scope bias (traces/APM/logs/OTLP coverage). See `evaluation/synthesis.md`.
+
+**Only bootstrap/workload SDK probes exist so far** (`existing-assertions.md`: 6 call sites — an
+ADP `antithesis_init()` + bootstrap `assert_reachable!` behind the `antithesis` feature, and two
+workload-side `assert_reachable!`/`assert_sometimes!` pairs in the harness). Every `Invariant`
+below is still **net-new** SUT-side instrumentation. Several properties are **expected to fail by
+design** under default config (memory limiter disabled, interner heap-fallback enabled, disk
+persistence off) — these are flagged; they are the highest-value findings, not catalog errors.
+
+Provenance tags `[Fn]` after each slug name the discovery focus that surfaced it:
+`[RB]` resource boundaries, `[DL]` data-loss/recovery, `[AG]` aggregation/sketch,
+`[LC]` lifecycle/config, `[RC]` replay/codec/concurrency, `[WC]` wildcard (from SUT analysis).
+
+---
+
+## Category A — Memory & Resource Bounds
+
+The "deterministic resource usage" / "won't OOM" half of the headline guarantee. Critical finding:
+the bound is a **startup assertion about declared sizes**, not a runtime invariant; the runtime
+limiter is advisory (≤25ms backoff, 250ms sampling, cooperative), disabled by default, and the
+interner spills to the heap by default. This category probes whether RSS is *actually* bounded.
+
+### rss-bounded-under-cardinality — RSS bounded under high cardinality
+> **Status (2026-05-29): WORKLOAD WIRED + ROOT CAUSE REPRO'D** — `parallel_driver_send_dogstatsd`
+> (high-cardinality regime) floods distinct contexts in the Antithesis harness to drive this
+> behavioral bug under a run; and the root cause is reproduced as a unit test in
+> `lib/saluki-context/src/resolver.rs`
+> `tests::bug_default_heap_fallback_makes_context_resolution_unbounded` (default heap fallback ⇒
+> resolution never refuses ⇒ unbounded memory). Not fixed.
+| | |
+|---|---|
+| **Type** | Safety (expected to FAIL by design under default config) |
+| **Property** | Under a high-cardinality tag / many-distinct-timestamp flood, process RSS stays within the configured memory grant. |
+| **Invariant** | `Always(rss <= grant.effective_limit_bytes() * tolerance)`, read from the same `process_memory::Querier` the limiter uses. `Always` fits — RSS-within-grant must hold on every check. SUT-side `Sometimes(backoff_applied && rss_still_climbing)` localizes the mechanism better than a workload RSS probe. |
+| **Antithesis Angle** | 250ms RSS sampling + 25ms max advisory backoff (from 95%) means bursts blow past the limit between samples; only the source cooperates, the aggregate hot path never calls `wait_for_capacity`, and the interner heap-fallback default makes growth effectively unlimited. Scheduling/timing exploration surfaces the burst-vs-sample race the deterministic harness can't. |
+| **Why It Matters** | Directly tests "won't crash under load" / deterministic resource usage; failure means OOM under cardinality floods. |
+| **Priority** | High |
+
+**Open Questions:**
+- Acceptable RSS tolerance band over `effective_limit_bytes` (95%+25ms+250ms implies real overshoot even when healthy).
+- With `allow_context_heap_allocs=false`, does RSS actually become bounded, or do aggregate-map / per-context `SmallVec`s still escape under the many-timestamp flood? `(partial: context count is bounded to context_limit exactly — confirmed; per-value memory still unaccounted)`
+- Would the container OOM-kill ADP before the assertion fires (crash vs. over-grant reading)?
+- _Resolved:_ nothing enables a memory limit by default — `memory_mode` defaults to `Disabled` (limiter is a no-op); cgroup auto-detect requires a non-empty `DOCKER_DD_AGENT` env var AND `memory_limit` unset AND a successful cgroup parse (`accounting.rs:107-121`). Confirms the fails-by-design framing.
+
+### aggregate-context-limit-enforced — Aggregate context limit enforced
+| | |
+|---|---|
+| **Type** | Safety (expected to HOLD) |
+| **Property** | The aggregation map never exceeds `aggregate_context_limit` (default 1,000,000) live contexts; over-cap new contexts are dropped-and-counted; existing contexts always merge. |
+| **Invariant** | `Always(contexts.len() <= context_limit)` — true `Always` (no path grows past the cap). Plus `AlwaysOrUnreachable(existing context never dropped by cap)`, and `Sometimes(context_limit_breached)` / `Sometimes(events_dropped_on_cap)` to prove the boundary is reached (avoid vacuity). |
+| **Antithesis Angle** | Interleave a cardinality flood with flush timing and counter zero-value keep-alives (kept-alive counters still occupy slots) to exercise hitting/clearing the breach flag and re-admitting contexts — timing-sensitive. |
+| **Why It Matters** | This hard, always-on, lock-free cap is the central memory-determinism lever for the aggregator (the one non-advisory runtime bound). |
+| **Priority** | High |
+
+**Open Questions:**
+- _Resolved:_ the true live bound is exactly `context_limit` (not `limit + zero_value_count`). Zero-value keep-alive counters stay as ordinary entries in the single `contexts` map and DO count toward the cap (`mod.rs:568`; test `context_limit_with_zero_value_counters` at `mod.rs:1104-1157`); `len()` drops in the flush removal pass (`mod.rs:703-707`) when entries are empty AND past `counter_expire_secs`. The `Always(len <= context_limit)` target is correct.
+- Caveat (not a question): the cap counts contexts, not bytes — one context with many timestamped values is one entry but unbounded value memory; prose must not overclaim "bounded memory" (see `rss-bounded-under-cardinality`).
+
+### interner-full-bounded — Interner-full bounded vs. heap spill
+> **Status (2026-05-29): BUG DEMONSTRATED** as a unit test (shared with `rss-bounded-under-cardinality`) —
+> `lib/saluki-context/src/resolver.rs` `tests::bug_default_heap_fallback_makes_context_resolution_unbounded`
+> shows the default heap-allowed mode never refuses (unbounded) while heap-disallowed refuses (bounded
+> but lossy). Not fixed.
+| | |
+|---|---|
+| **Type** | Safety (heap-disallowed: HOLDS; heap-allowed default: FAILS the bounded reading) |
+| **Property** | When the fixed-size interner is full and heap allocations are disallowed, context resolution fails deterministically (metric dropped) instead of allocating; with heap allowed (the default), memory is no longer bounded. |
+| **Invariant** | Heap-off: `AlwaysOrUnreachable(interner_full ⇒ metric dropped, no heap alloc)` (rare/optional path). `Sometimes(try_intern == None)` proves exhaustion is reached. Heap-on: `Sometimes(intern_heap_fallback > 0)` proves the unbounded spill path is reachable under default config. SUT-side needed to distinguish interned / inlined / heap-fallback / dropped. |
+| **Antithesis Angle** | Small interner + high-cardinality flood fills the buffer; timing exploration stresses the loom-tested reclamation/tombstone path under concurrent intern-vs-drop. |
+| **Why It Matters** | Interner determinism is the foundation of the context memory bound; the default flag flips ADP into the unbounded branch, voiding the bounded-memory guarantee silently. |
+| **Priority** | High |
+
+**Open Questions:**
+- Does fragmentation make `try_intern` return `None` below nominal byte capacity under churn (earlier spill than the budget implies)?
+- Is the real bound the sum across name + tag interning (they share one interner)?
+- _Resolved:_ `dogstatsd_allow_context_heap_allocs` defaults to **true** (`sources/dogstatsd/mod.rs:149-151`; resolver fallback also true, `resolver.rs:258`); bounded mode (`with_heap_allocations(false)`) appears **only in `#[cfg(test)]`**. So bounded mode is opt-in/test-only and "bounded memory" is aspirational under default config.
+
+### memory-limiter-survives-rss-read-failure — Memory limiter survives RSS read failure
+| | |
+|---|---|
+| **Type** | Safety / fault-tolerance (expected to FAIL by design) |
+| **Property** | If RSS becomes unreadable mid-run, memory protection remains active (or the failure is surfaced) rather than silently freezing. |
+| **Invariant** | `Unreachable("limiter RSS read failed — protection lost")` at the `.expect()` site (`limiter.rs:100-102`) — the panic-the-checker-thread state is a critical failure that must never be observed (today it can be). Fix-dependent alternative: `Sometimes(rss_read_failed_and_surfaced)` + liveness check that `active_backoff` is still being updated. Needs SUT-side instrumentation. |
+| **Antithesis Angle** | Inject `/proc`/RSS read failure mid-run; the damaging race is reads failing *before* RSS crosses threshold, freezing backoff at 0 (fail-open). The bare `std::thread` death doesn't trigger process shutdown — silent. |
+| **Why It Matters** | The limiter is already the only runtime memory mechanism; silently disabling it removes the last guard against OOM under load. |
+| **Priority** | **High** (R9; upgraded — the user confirmed custom `/proc` faults are enabled for the tenant, so the failure state is reachable). Still requires the limiter to be explicitly enabled and a SUT-side assertion, since the bare-thread death is otherwise unobservable. |
+
+**Open Questions:**
+- Can `Querier::resident_set_size()` actually return `None`/error *after* succeeding at startup on the Antithesis Linux target, or only via injected `/proc` corruption? Pivotal for priority.
+- Is the frozen backoff more likely 0 (fail-open) or nonzero (fail-stuck, over-throttle)? Opposite symptoms.
+- Should correct behavior be "keep last-known protection" or "fail loudly and restart" (data components are fail-stop; s6 restarts ADP)? Changes Unreachable(panic) vs. Reachable(clean restart) framing.
+
+### retry-queue-bounded-under-outage — Retry queue bounded under outage
+| | |
+|---|---|
+| **Type** | Safety (byte cap) + Liveness (bound implies counted data loss) |
+| **Property** | During a prolonged intake outage the forwarder retry queue (in-memory + disk) stays within its configured byte caps; overflow drops oldest (counted), never grows unbounded. |
+| **Invariant** | `Always(total_in_memory_bytes <= max_in_memory_bytes)` and the analogous disk-bytes `Always` (eviction loops make these true invariants). `Sometimes(items_dropped > 0)` / `Sometimes(persisted_entries_dropped > 0)` prove saturation is reached and the bound is load-bearing. Optional `Reachable` on the "entry too large to ever fit" branch. |
+| **Antithesis Angle** | Hold mock intake down (refused / black-hole / 5xx / slow) under sustained load until the queue saturates; interleaving stresses the shared circuit-breaker backoff + per-endpoint queues + disk eviction. |
+| **Why It Matters** | Tests "won't crash, won't lose data" at its sharpest tension — memory bounded by design means prolonged outage forces counted-but-real data loss. |
+| **Priority** | High |
+
+**Open Questions:**
+- Assert per-endpoint cap or aggregate `num_endpoints * cap` (+ disk)? Fan-out multiplies the bound; matters for RSS protection.
+- Is there a time-based eviction policy (queue-duration) beyond the byte cap, like the Agent's `queue_duration_capacity`?
+- _Resolved:_ corrupt/torn files are dropped and skipped without wedging the queue, and the byte cap holds across drops (see `disk-persisted-retry-survives-restart`); the disk-init-failure fallback keeps the in-memory byte cap and is surfaced only via an `error!` log.
+
+---
+
+## Category B — Data Integrity & No Silent Loss
+
+The "won't lose customer data" half of the headline. Covers the internal backpressure path, egress
+delivery, crash durability, event routing, and shutdown drain.
+
+### no-silent-interconnect-drop — No silent inter-component drop on a wired edge
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | Under sustained load with a slow downstream, a correctly-wired interconnect edge applies backpressure (await) and never silently discards events; backpressure propagates back to the socket. |
+| **Invariant** | `Always(events_discarded_total delta == 0)` for a connected output under load. The discard branch fires only when `senders.is_empty()`, so a wired edge can never discard. Pair with `Sometimes(backpressure engaged)` so the test proves it reached the full-channel state. Not a blanket `Unreachable` on the discard site — disconnected outputs discard legitimately. |
+| **Antithesis Angle** | Throttle the forwarder/intake so encoder→forwarder fills, cascading backpressure up every bounded mpsc edge to the DSD read loop; verify queue-and-await instead of drop. |
+| **Why It Matters** | Directly the "no silent loss" half of the headline guarantee on the hottest internal path. |
+| **Priority** | High |
+
+**Open Questions:**
+- Do any production DSD outputs ever have zero senders (e.g. conditional `dsd_debug_log_out`/`dsd_stats_out`)? If so, scope `Always` to always-wired outputs.
+- _Resolved:_ `interconnect_capacity` default is **128** event buffers (`DEFAULT_INTERCONNECT_CAPACITY`, `topology/mod.rs:37`; every non-source edge is `mpsc::channel(128)`). No ADP override. This sizes the burst before backpressure; tunes the test, not the property.
+- Exclude the non-atomic multi-sender partial-delivery case (a closed channel at teardown, not a full one) from the assertion window.
+
+### forwarder-eventual-delivery — Eventual delivery after transient intake outage
+> **Status (2026-05-29): PARTIALLY WIRED** — `finally_verify_delivery` asserts the fault-free
+> eventual-delivery liveness baseline (`Sometimes(delivered>0)`); the post-outage-recovery facet
+> (inject a 5xx/timeout storm, then heal) is a later iteration.
+| | |
+|---|---|
+| **Type** | Liveness |
+| **Property** | After a transient intake outage (5xx/timeouts/connection resets) clears, every accepted retryable transaction is eventually delivered, provided the retry queue did not overflow. |
+| **Invariant** | `Sometimes(all-accepted-retryable-delivered-after-recovery)`: at least once, post-recovery delivered count equals accepted-retryable count submitted before/during the outage (no overflow). Plus `Reachable` on the `Error::Open` re-enqueue site to prove the breaker engaged. Liveness → assert progress, not an instantaneous invariant. |
+| **Antithesis Angle** | Inject a bounded 5xx/timeout/connection-reset storm, then restore 2xx; circuit-breaker backoff + re-enqueue to low-priority queue must recover the backlog. |
+| **Why It Matters** | Egress data-loss surface; retry path only unit-tested against in-process mocks with a virtual clock. |
+| **Priority** | High |
+
+**Open Questions:**
+- Size the outage shorter than `queue_max_size_bytes` overflow, or exclude oldest-dropped txns; overflow is the intended bounded-memory escape valve (a test-setup constraint, not a property uncertainty).
+- _Resolved:_ production forwarder requests are always `Clone` (`FrozenChunkedBytesBuffer`; `clone_request` returns `Some` unconditionally) → retryable failures always take the `Error::Open` re-enqueue path, never the non-cloneable `Error::Service` drop.
+- _Resolved:_ the circuit-breaker backoff is **per-endpoint** (each `run_endpoint_io_loop` builds its own `Arc<Mutex<State>>`), so one slow endpoint cannot serialize others' recovery.
+
+### disk-persisted-retry-survives-restart — Disk-persisted retry survives kill+restart
+| | |
+|---|---|
+| **Type** | Liveness (with no-duplication + poison-drop safety sub-clauses) |
+| **Property** | With disk persistence enabled, retry-queue transactions survive a process kill+restart and are eventually delivered with no systemic loss or duplication; corrupt entries are dropped, not retried forever, without aborting recovery. |
+| **Invariant** | `Sometimes(persisted-backlog-fully-recovered)`: post-restart delivered set covers the persisted backlog, deduped. `AlwaysOrUnreachable(poison-dropped)`: any corrupt on-disk entry is dropped and recovery proceeds — never infinite retry, never abort. `Reachable(persistence-active)`: the silent in-memory fallback did NOT fire, else the test is vacuous. |
+| **Antithesis Angle** | SIGKILL mid-outage (s6 restarts ADP), restore intake, reconcile delivery; separately inject a corrupted on-disk entry to exercise poison handling and torn-write recovery. |
+| **Why It Matters** | Durability across crash is never tested system-level; delete-before-return and silent fallback are subtle correctness levers. |
+| **Priority** | High |
+
+**Open Questions:**
+- At-most-once window: delete-before-return then crash-before-send loses one in-flight txn — the reconcile must tolerate small (~per-endpoint-concurrency) slack, not assert exact equality.
+- _Resolved:_ a torn/partial write across a crash yields a valid filename + truncated content → `serde_json` fails → entry **dropped** (warn + `entries_dropped++`) on read; the scan `continue`s past any number of bad files, so **one bad file never wedges recovery** and the byte cap is never violated (dropping decrements the total). Note: `push` writes non-atomically straight to the final path (`persisted.rs:184`, despite a stale "temporary file" comment).
+- _Resolved:_ the disk-init-failure → in-memory fallback is surfaced **only as an `error!` log (io.rs:406), no metric** — the workload must log-scrape or `assert_unreachable` at the fallback to keep the durability test non-vacuous; the in-memory byte cap still holds after fallback.
+
+### source-dispatch-no-misroute — DSD source dispatch never mis-routes events
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | A mid-buffer dispatch failure in the DogStatsD source never mis-routes eventd/service-check events into the metrics output path (or vice versa). |
+| **Invariant** | **(R8) Primary facet — silent loss:** `Sometimes(dispatch_failed)` + an SUT-side check that a dispatch failure is **counted** (today the `error!` logs at `mod.rs:1688/1702/1713` may be the only signal — unlike interconnect `events_discarded_total`). **Secondary facet — misroute:** `Unreachable(misroute)` (an eventd/service-check Event reaching the metrics dispatch, or vice versa) — structurally improbable with the current `extract`-then-`send_all` ordering, so it functions as a future-refactor guard, not the live hazard. |
+| **Antithesis Angle** | Mixed eventd+service-check+metric buffer while forcing a downstream output to error mid-`send_all`; verify failures are accounted (not silently dropped) and no cross-output leakage. |
+| **Why It Matters** | The live hazard is **silent, uncounted loss** on dispatch failure (a finding); wrong-stream delivery would additionally corrupt data semantics, and `.expect()` on outputs is a latent crash. |
+| **Priority** | Medium |
+
+**Open Questions:**
+- Can `extract` (swap_remove_back + collected indices) ever leave a matching event behind? This is the crux — if extraction is correct, misroute is impossible.
+- Is dispatch-failure loss counted anywhere? The `error!` logs may be the only signal — possibly fully silent (a finding).
+- Does a persistent downstream failure wedge the source ("until the process is restarted"), feeding the fail-stop story?
+
+### shutdown-drains-no-loss — Shutdown drains accepted events to the forwarder
+| | |
+|---|---|
+| **Type** | Liveness (with a safety boundary at the 30s timeout) |
+| **Property** | Every event accepted before the shutdown signal that has reached a flushed window is forwarded before clean shutdown, provided the topology drains within the 30s grace window. |
+| **Invariant** | `Sometimes(clean-drain-no-loss)`: at least once, shutdown completes cleanly within 30s and all pre-signal flushed-window events reach the mock intake. `AlwaysOrUnreachable(timeout-implies-forceful)`: whenever the 30s timeout fires, the run loudly reports forceful stop — in-flight loss is never silent. NOT a blanket Always-no-loss: open-window default-drop and forceful-stop are designed losses. |
+| **Antithesis Angle** | Shutdown under load; combine with a slow/blocked intake to push past 30s and exercise the forceful-stop boundary; with disk persistence on, verify backlog persists instead of being lost. |
+| **Why It Matters** | Graceful-shutdown drain is a stated guarantee; interacts with forwarder/disk-persistence at the timeout boundary. |
+| **Priority** | High |
+
+**Open Questions:**
+- Does the source finish dispatching its current read buffer before signaling done, or are just-accepted events still in the source lost on clean shutdown?
+- Does the final aggregate flush on stream close emit only closed windows, and do they clear the encoder→forwarder before the deadline under load?
+- Realistic drain time at max load with healthy intake — if it approaches 30s, the clean case is fragile and 30s adequacy is itself a finding.
+- Are PassthroughBatcher-buffered (pre-timestamped) metrics flushed on stop or dropped?
+
+### events-sc-no-silent-loss — Events and service-checks delivered without silent loss
+| | |
+|---|---|
+| **Type** | Liveness (with a Safety no-silent-drop clause) |
+| **Property** | Under intake backpressure/outage, the events and service-checks sub-paths apply backpressure (never silently drop on a wired edge) and, after a transient outage clears, every accepted event/SC that did not legitimately overflow is eventually delivered. |
+| **Invariant** | Safety `Always(no silent drop on a wired events/SC edge under load)` + Liveness `Sometimes(all-accepted-events-delivered-after-recovery)` / `Sometimes(all-accepted-SC-delivered-after-recovery)` reconciling `component_events_received_total{message_type}` vs `events_sent`. Two silent-loss sites need SUT-side coverage: the encoder recoverable-error drop (`events/mod.rs:179-194`, undercounted) and the wrong-type swallow (`try_into_eventd`→`Continue`, uncounted). No aggregation stage → ~1:1 accept→deliver (modulo `MAX_EVENTS_PER_PAYLOAD=100`), a cleaner reconcile than metrics. |
+| **Antithesis Angle** | Throttle/down the mock intake so the always-on `dsd_in.{events,service_checks} → *_enrich → dd_*_encode → dd_out` edges fill; verify queue-and-await, then restore and reconcile. Extends `no-silent-interconnect-drop` / `forwarder-eventual-delivery` to two always-on edges the catalog ignored. |
+| **Why It Matters** | The "won't lose customer data" guarantee on two always-on production paths no other property watches. |
+| **Priority** | High |
+
+**Open Questions:**
+- Is the encoder recoverable-error branch ever hit on healthy intake with well-formed events, or only on oversized single events? (Decides the optional `Always` guard.)
+- Do events/SC retries share `dd_out`'s per-endpoint queue with metrics (cross-stream eviction by a metric flood)?
+- Are events/SC requests `Clone` (retryable failures take the re-enqueue path), as confirmed for metrics?
+- Does `dispatch_events` count anything when `send_all` errors, or is dispatch-time loss fully silent? (ties to `source-dispatch-no-misroute`)
+
+### events-sc-pipeline-reachable — Events and service-check sub-pipelines are actually exercised
+| | |
+|---|---|
+| **Type** | Reachability |
+| **Property** | At least once per run, a well-formed event and a well-formed service-check are parsed/accepted at the source and delivered through the encoder to the intake — so the event/SC safety/liveness properties cannot pass vacuously. |
+| **Invariant** | `Sometimes(event_parsed_and_accepted)` / `Sometimes(service_check_parsed_and_accepted)` (source `component_events_received_total{message_type=events|service_checks}`) + `Sometimes(event_delivered)` / `Sometimes(service_check_delivered)`. Strengthen to `Reachable` if the workload guarantees ≥1 well-formed event + SC. This is the **R4 anti-vacuity anchor** for `events-sc-no-silent-loss` and `malformed-event-sc-no-crash`. |
+| **Antithesis Angle** | Anti-vacuity anchor for a metrics-dominated workload; also catches a wiring/`EnablePayloads`-default regression that silently removes an always-on path. |
+| **Why It Matters** | The catalog is otherwise metrics-only; events/SC are rare in real traffic, so the new event/SC properties need an explicit reachability obligation to mean anything. |
+| **Priority** | Medium |
+
+**Open Questions:**
+- Delivery anchor on encoder `events_sent` vs the mock intake receiving the `/api/v1/events_batch` / service-check POST? Intake observation is stronger but needs the mock to distinguish endpoints.
+- One anchor per stream, or four (event/SC × parsed/delivered) to localize a parse-but-not-deliver regression?
+
+---
+
+## Category C — Aggregation & Sketch Correctness
+
+ADP must match the Datadog Agent bit-for-(approximately-)bit. The diff-test suite checks happy-path
+equivalence; these properties target correctness under faults, edge cases, and timing — plus a
+guaranteed-crash clock hazard (`aggregate-clock-skew-stable` forward-jump) the suite cannot reach.
+(The former sub-second-window divide-by-zero crash is now fixed upstream, PR #1772, and survives only
+as a regression tripwire.)
+
+### aggregate-matches-agent — Aggregated output matches the Datadog Agent
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | For the same input stream, ADP's aggregated output (counter→rate via bucket width, half-open `[start,start+width)` buckets, histogram/distribution stats) equals the Datadog Agent's, and equivalence is preserved under faults. |
+| **Invariant** | Workload/harness-side `Always(diff within ratio)` on the normalized `stele` diff per flush window, with faults active; `Sometimes(fault injected during window)`. Differential property — anchored on the `panoramic`/`stele` diff harness, not a single in-process assertion. |
+| **Antithesis Angle** | Diff-test equivalence under delayed/skipped flush, process kill+restart mid-window, and downstream backpressure — faults the deterministic `panoramic` harness cannot inject. |
+| **Why It Matters** | Wrong aggregates are silent, customer-visible data corruption; the headline correctness guarantee. |
+| **Priority** | High |
+
+**Open Questions:**
+- Can `panoramic` survive an ADP restart mid-run, and is `FLUSH_WAIT=32s` enough once faults delay flushes (timing-artifact false diffs)?
+- Is the Agent baseline's bucket width guaranteed identical to ADP's `aggregate_window_duration`?
+- Are zero-value counters emitted identically by both sides across a skipped flush?
+
+### aggregate-no-panic-any-window — No window duration causes a panic
+> **Status (2026-05-30): FIXED UPSTREAM on main** — the `% 0` panic vector is now structurally
+> impossible. The config key is renamed `aggregate_window_duration_seconds` and deserializes as
+> `NonZeroU64` (`transforms/aggregate/mod.rs:95-98`); `align_to_bucket_start` takes a `NonZeroU64`
+> and divides by `bucket_width_secs.get()` (`:822-823`), so zero/sub-second values fail config
+> parsing instead of reaching the divisor (PR #1772). The earlier repro
+> `tests::bug_sub_second_aggregate_window_panics_on_insert` is therefore **stale** — see the bug
+> ledger (the repro lives in a sibling commit and should be dropped or converted to a passing guard).
+> Property retained as a low-cost regression tripwire.
+| | |
+|---|---|
+| **Type** | Safety / Reachability |
+| **Property** | No `aggregate_window_duration_seconds` value causes a panic; the divisor is never zero. |
+| **Invariant** | `Unreachable("align_to_bucket_start reached with bucket_width_secs == 0")` as a regression tripwire — should never fire now that the type is `NonZeroU64`. Fires only if a future refactor reintroduces a zero-able window or a finer-grained (sub-second) divisor. |
+| **Antithesis Angle** | Cheap: explore the (now `NonZeroU64`) config space and confirm no divisor-zero path is reachable. Primary value is guarding against a regression, not finding the original (closed) bug. |
+| **Why It Matters** | The original guaranteed crash + restart loop is closed; the tripwire keeps it closed. |
+| **Priority** | Low (R-2026-05-30: demoted from High — panic vector closed upstream; retained as a regression guard). |
+
+**Open Questions:**
+- _Resolved (2026-05-30):_ the fix landed as a type change (`NonZeroU64`), not runtime clamping, so zero/sub-second windows are rejected at config load.
+- Can the gRPC dynamic-config stream push `aggregate_window_duration_seconds` at runtime? Even so, a non-`NonZeroU64` value cannot deserialize, so there is no live divisor-zero vector.
+
+### aggregate-clock-skew-stable — Aggregation stays sane across wall-clock skew
+> **Status (2026-05-29): BUG DEMONSTRATED** (forward-jump facet) as a unit test —
+> `lib/saluki-components/src/transforms/aggregate/mod.rs`
+> `tests::bug_forward_clock_jump_floods_zero_value_points`. A forward wall-clock jump makes `flush`
+> build `zero_value_buckets` over the whole jumped interval (O(jump) work/alloc) and flood one idle
+> counter with points proportional to the jump. Backward-jump gap facet not yet repro'd. Not fixed.
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | A backward/forward wall-clock jump never floods zero-value points nor silently breaks counter continuity; bucketing stays bounded and well-formed. |
+| **Invariant** | `Always(zero_value_buckets.len() <= ceil(flush_interval/window)+slack)` and `Always(current_time >= last_flush)` inside `flush`; `Sometimes(clock jumped during flush)`. CONFIRMED two-clock hazard: wall-clock bucketing vs monotonic flush cadence, no monotonicity guard; forward jump floods the zero-value loop, backward jump empties it. |
+| **Antithesis Angle** | Clock fault injection (NTP step backward/forward) during a steady counter stream. |
+| **Why It Matters** | Metric flood + memory spike (forward) or silent continuity gap and mis-expiry (backward); diverges from Agent. |
+| **Priority** | High |
+
+**Open Questions:**
+- Fix: monotonic source vs clamp `current_time = max(., last_flush)` + cap loop? Decides `Unreachable(flood)` vs `Always(bounded)`.
+- Any guard against `get_unix_timestamp()` returning 0 (pre-epoch via `unwrap_or_default`)? None found.
+- Does the Agent behave identically under the same step (ties to `aggregate-matches-agent`)?
+
+### ddsketch-bin-count-bounded — Bin count never exceeds bin_limit
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | After any inserts / multi-weight inserts / interpolations / merges, an agent `DDSketch`'s bin count never exceeds `bin_limit` (4096). |
+| **Invariant** | `Always(self.bins.len() <= bin_limit)` after every mutating method (post-`trim_left`); `Reachable("trim_left collapsed bins")`. CONFIRMED enforced today by `trim_left` at every mutation site; already has unit + proptests. SUT-side. |
+| **Antithesis Angle** | Live regression tripwire on real sketches after arbitrary interleavings (histogram→distribution, cross-window merges) — catches a new mutator that forgets `trim_left`. The `Reachable(trim_left collapsed bins)` anchor is **essential** or the `Always` is vacuous (real corpora rarely exceed 4096 keys). |
+| **Why It Matters** | Bin explosion → memory blowup and historically an encoder panic. |
+| **Priority** | Medium (R6: demoted from High — substantially duplicates existing proptests; the unique value is a live tripwire for a future `trim_left`-forgetting mutator). |
+
+**Open Questions:**
+- Is test-only `insert_raw_bin` (bypasses `trim_left`) truly unreachable in release?
+- _Resolved:_ only the **agent sketch (4096)** is on the live path; `ddsketch::DDSketch` re-exports the agent impl (`lib.rs:56`). The canonical sketch (2048 bins) has no non-test usage. Assert against the agent `bin_limit` (4096).
+
+### ddsketch-relative-error-bound — Quantile accuracy + merge associativity
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | For in-range values, the sketch's quantile estimates are within configured relative error (eps≈0.78%) and merges are associative/commutative — a **library invariant** of the agent DDSketch, since ADP does not query quantiles on the live path (see below). |
+| **Invariant** | Harness/SUT-side `Always(|q_est - v| <= eps_rel*|v|)` for in-range inputs and `Always(merge order-independent within tolerance)`, exercised against the agent sketch directly (not a production runtime call). The lower-value live facet — faithful bin serialization + bin count ≤ 4096 — is already covered by `ddsketch-bin-count-bounded`. eps=1/128, gamma=1+2·eps confirmed; `avg`/`sum` in `merge` are f64 order-sensitive. |
+| **Antithesis Angle** | Merge order under interleaving (delayed flush / backpressure reordering) and accuracy at the representable-range boundary — invisible to the single-order diff test. |
+| **Why It Matters** | Quantile/avg drift is silent wrong customer data. |
+| **Priority** | Low (R5: demoted Medium→Low — the accuracy guarantee is a library/proptest invariant, not a live ADP runtime invariant; ADP never calls `DDSketch::quantile` on the customer path). |
+
+**Open Questions:**
+- Acceptable f64 tolerance for `avg`/`sum` under reordered merges vs diff-test's 1e-8?
+- _Resolved:_ ADP does **not** call `DDSketch::quantile` on the live customer path — histogram percentiles use raw-sample `HistogramSummary::quantile`; distribution sketches are serialized as raw bins and quantiled server-side. Only the agent sketch is live. So this property is a library/harness-side invariant, not a production runtime assertion.
+
+### ddsketch-no-nan-poison — NaN never silently poisons sum/avg
+> **Status (2026-05-29): BUG DEMONSTRATED** as a unit test —
+> `lib/ddsketch/src/agent/sketch.rs` `tests::bug_nan_sample_poisons_sum_and_avg`. A single NaN sample
+> permanently poisons `sum`/`avg` (sticky) while `count`/`min`/`max` stay valid (silent corruption);
+> no finiteness guard at the sketch boundary. Not fixed (demonstration only).
+| | |
+|---|---|
+| **Type** | Safety / Reachability |
+| **Property** | A single NaN sample must never silently poison a sketch's `sum`/`avg`; for finite input, sum/avg stay finite, and the boundary rejects/sanitizes non-finite values. |
+| **Invariant** | `Always(v.is_finite())` at `adjust_basic_stats` entry (or `Unreachable` on absorbed-NaN path) + backstop `Always(self.sum.is_finite())`. CONFIRMED no finiteness guard in `adjust_basic_stats`; `sum += v*n` makes NaN sticky; `key(NaN)` still yields a valid bin; `insert_n` called directly from aggregate. Guarded only per-source. SUT-side. |
+| **Antithesis Angle** | **A confirmed LIVE bypass:** a `checks_ipc` Histogram metric carrying a NaN value (`checks_ipc/mod.rs:195`, no finiteness check) routes `checks_ipc_in.metrics → metrics_enrich → dd_metrics_encode`, bypassing both the DSD `FloatIter` and the aggregate transform, and reaches `ddsketch.insert_n(...)` in the Datadog metrics encoder (`encoders/datadog/metrics/mod.rs:1054`) → `adjust_basic_stats` → `sum += NaN`. Poison then propagates through `merge`. |
+| **Why It Matters** | Permanent silent corruption of sum/avg across sketches and downstream — and it is reachable today, not hypothetical. |
+| **Priority** | High |
+
+**Open Questions:**
+- Fix policy: reject/skip (match codec drop) vs clamp; confirm against Agent baseline.
+- _Resolved:_ the hazard is **LIVE** via `checks_ipc` Histogram metrics (above). OTLP is closed (number values filtered by `is_skippable`; histogram sketches reconstruct finite bounds) and the DSD `aggregate insert_n` path is closed (DSD-only, finiteness-filtered upstream). The robust fix/assertion belongs at the **sketch boundary** (`adjust_basic_stats`/`insert*`), justified by the concrete checks_ipc bypass. Target sum/avg (not `quantile`, which has its own NaN fallback); ±Inf is in scope (`is_finite` covers both).
+
+---
+
+## Category D — Lifecycle & Configuration
+
+Startup ordering, the no-timeout config-stream wait, fail-fast on incompatible config, bounded
+graceful shutdown, and the fail-stop model (data components are not restarted).
+
+### topology-ready-before-intake — Topology becomes ready before data intake begins
+| | |
+|---|---|
+| **Type** | Liveness + Safety (ordering) |
+| **Property** | The internal supervisor reaches `all_ready` before the data topology is built/spawned, and the full topology eventually reaches `all_ready`. |
+| **Invariant** | `Always(internal-supervisor-ready precedes blueprint.build())` + `Sometimes(full topology reaches all_ready)`. Ordering of readiness milestones is the defensible guarantee. |
+| **Antithesis Angle** | Stall a downstream component's readiness (forwarder blocked on dead intake at init) or fail an internal-supervisor child to init; verify build/spawn never precedes supervisor-ready. |
+| **Why It Matters** | If the topology spawned before dependencies were ready, sources could read into an unwired/uninitialized pipeline. |
+| **Priority** | High |
+
+**Open Questions:**
+- Does `dsd_in` bind/accept on its listeners *before* `mark_ready`? Determines whether the property strengthens from "milestone ordering" to "no socket bound pre-ready."
+- Readiness is not latched; a late-registered component could drop `all_ready` back to false. Confirm all data components register before the wait.
+
+### config-stall-no-deadlock — Config-stream stall does not deadlock or busy-loop startup
+> **Status (2026-05-29): BUG DEMONSTRATED** as a unit test —
+> `lib/saluki-config/src/lib.rs` `tests::bug_config_ready_hangs_forever_without_snapshot`. With
+> dynamic config enabled and the sender held open but silent, `GenericConfiguration::ready()` never
+> resolves (no internal timeout) → ADP startup would hang forever. Not fixed.
+| | |
+|---|---|
+| **Type** | Liveness |
+| **Property** | When the Core Agent config stream is delayed/dropped/erroring, ADP either progresses to "Initial configuration received" or remains quiescently blocked at the wait — never crashing or busy-looping. |
+| **Invariant** | `Sometimes("config received")` + `Reachable("config wait entered")`; no panic, no busy-loop (workload-side CPU/log-rate check on the quiescent hang). CONFIRMED no timeout on `ready()` nor bootstrap registration await. |
+| **Antithesis Angle** | Drop the snapshot (registered but never streamed) → **quiescent indefinite hang** at `ready().await` (no timeout) — the primary falsification target; flap stream → 5s reconnect. |
+| **Why It Matters** | ADP assumes Core Agent reachability; a true indefinite hang with no timeout is operationally surprising (ADP never starts the pipeline, no diagnostic deadline). |
+| **Priority** | High |
+
+**Open Questions:**
+- _Resolved (busy-loop hazard downgraded):_ a steadily-erroring stream does NOT busy-loop. The tonic `Streaming` yields at most one `Err` per stream (initial error fuses to `Terminated`; mid-stream error yields once then `None`), so the loop logs once, `continue`s once, then exits to the **5s sleep** (`remote_agent.rs:302`). No unbounded spin.
+- _Resolved:_ `init_reg_rx.await` is always eventually resolved — `session_id` always starts empty so the first tick takes the register branch, which always sends Ok/Err on `initial_registration_tx`.
+- Confirmed real gap: `GenericConfiguration::ready()` (`lib.rs:694-704`) and the bootstrap registration await have **no timeout** — an open-but-silent stream hangs startup forever by design.
+
+### config-incompatible-refuses-start — High-severity incompatible config refuses to start
+| | |
+|---|---|
+| **Type** | Safety (Reachability / Unreachable) |
+| **Property** | ADP never spawns the data pipeline when a high-severity-incompatible non-default config key is present; it exits 1 instead. |
+| **Invariant** | `Unreachable("pipeline spawned with high-severity incompatible non-default key")` + `Reachable("ADP refused to start")`. The gate `check_and_warn_config` runs **before** create_topology/build/spawn; its `Err` → `exit(1)`. |
+| **Antithesis Angle** | Inject a current `Severity::High` non-default key; expect exit 1, no listener, no data. Negative controls: same key at default → starts; Medium/Low/Partial → starts. |
+| **Why It Matters** | Running with an incompatible setting risks wrong aggregates / silent data corruption — fail-fast is the safety stance. |
+| **Priority** | Medium (R7: demoted from High — a deterministic ordered gate already covered by the integration suite's config-check-exit-code cases; the `Unreachable` is statically unreachable, so the real artifact is the `Reachable(refused)` exploration). |
+
+**Open Questions:**
+- Confirm env-var overrides are visible to the classifier at the gate.
+- _Resolved:_ partial config updates over the stream are NOT re-validated (the gate runs once at startup, `run.rs:157`; `ConfigClassifier` is referenced only in `run.rs`). This property is correctly scoped to **startup**; the runtime gap is now tracked as its own property `config-runtime-update-not-revalidated`.
+
+### config-runtime-update-not-revalidated — Runtime config updates bypass the incompatibility gate
+| | |
+|---|---|
+| **Type** | Safety (Reachability / scope gap) |
+| **Property** | A high-severity-incompatible non-default config key delivered over the runtime config stream is never applied silently — or, if startup-only gating is intentional, the unguarded runtime-apply path is at least documented and observable. |
+| **Invariant** | `Unreachable("pipeline running with high-severity incompatible non-default key after a runtime config update")`, or `Reachable` on the unguarded runtime-apply path if the design is startup-only. The startup gate (`check_and_warn_config`, `run.rs:157`) is the only classifier callsite; runtime `Partial`/`Snapshot` updates are applied without re-validation. |
+| **Antithesis Angle** | Start ADP clean (passes the gate), then inject a config-stream update carrying a high-severity-incompatible non-default key; observe whether ADP refuses/flags or silently applies it. Exercises the control-plane → data-plane config path the diff-test never touches. |
+| **Why It Matters** | Running with an incompatible setting risks wrong aggregates / silent data corruption — the exact outcome the startup gate prevents. The protection has a runtime hole. |
+| **Priority** | Medium |
+
+**Open Questions:**
+- Is startup-only gating intentional (runtime updates trusted as authoritative from the Core Agent) or an oversight? `(needs human input)`
+- Can a `Severity::High` key actually be delivered over the config stream, or does the Core Agent pre-filter what it sends to remote agents? Determines real-world reachability.
+
+### graceful-shutdown-within-30s — Graceful shutdown completes within 30s without forceful kill
+| | |
+|---|---|
+| **Type** | Liveness (bounded-time) + Reachability |
+| **Property** | On SIGINT (or unexpected component finish) under bounded in-flight load, the data topology stops cleanly within the 30s grace window without the forceful-stop path. |
+| **Invariant** | `Sometimes(clean topology shutdown completed)` + `Reachable(forceful-stop path under adversarial load)`. Distinct from `shutdown-drains-no-loss` (which owns *what data survives*); this owns *clean completion in time*. |
+| **Antithesis Angle** | SIGINT under bounded load → clean exit; SIGINT with forwarder wedged on dead intake → forced-stop after 30s → exit 1; schedule a component to finish at the 30s boundary. |
+| **Why It Matters** | Forceful kill drops in-flight data and risks state corruption; bounded clean shutdown is the operational contract. |
+| **Priority** | High |
+
+**Open Questions:**
+- The **internal supervisor** shutdown has **no timeout** — the process could exceed 30s even when the topology met it. Scope assertions to topology-shutdown completion, or file a separate property.
+- On forceful stop the `JoinSet` is dropped, aborting tasks; confirm no shared-state corruption (overlaps data-loss family).
+- Confirm the `shutdown_coordinator` cascade reliably reaches every component.
+
+### data-component-failure-triggers-process-shutdown — Data component finish triggers whole-process shutdown
+| | |
+|---|---|
+| **Type** | Safety (Always) + Reachability |
+| **Property** | Because data components have no restarting supervisor, any data component finishing unexpectedly deterministically triggers whole-topology shutdown — never a silent half-running pipeline. |
+| **Invariant** | `Reachable(unexpected-finish → shutdown path)` + temporal `Always(component-death always followed by process exit)`. Data topology uses a plain `JoinSet` with no restart; supervisor restart is internal-only. |
+| **Antithesis Angle** | Induce a component panic (hot-path `.expect`/`unreachable!`; note the former sub-second `aggregate_window_duration` panic vector is closed upstream) or a clean early finish; verify fail-stop fires and the process exits (s6 restarts it). |
+| **Why It Matters** | A half-running pipeline silently drops/mis-routes data while appearing alive; fail-stop + full-process restart is the recovery model. |
+| **Priority** | High |
+
+**Open Questions:**
+- Brief gap between spawn and the `select!`: a component dying there is still buffered by the `JoinSet` and caught once the select polls — confirm safe.
+- `expect("no components to wait for")` panics on an empty topology; confirm a spawned topology is always non-empty (guarded by `data_pipelines_enabled`). Low priority.
+
+---
+
+## Category E — Untrusted Input Parsing
+
+The DogStatsD codec and the new (zero-suite-coverage) capture/replay reader parse untrusted bytes.
+Antithesis treats malformed input as a first-class fault dimension.
+
+### malformed-dsd-no-crash — Malformed DSD packets never crash process/socket
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | Malformed/adversarial DogStatsD packets on any listener never crash the process or kill a connectionless socket; bad packets are skipped/cleared-and-continued. |
+| **Invariant** | `Always(process up)` + `Always(connectionless socket survives a bad packet)` + `Unreachable` at codec panic sites (`unreachable!`/`from_utf8_unchecked`); `Sometimes(framing_errors>0)`, `Sometimes(*_decode_failed>0)`. |
+| **Antithesis Angle** | Untrusted packet input across UDP/TCP/UDS-dgram/UDS-stream; oversized / invalid-UTF8 / truncated-extension / NUL / huge-tag packets exploring codec + framing error paths. |
+| **Why It Matters** | Clear-and-continue is the socket-survival mechanism; codec error policy is undecided (4 TODOs). Only a non-exhaustive proptest covers this today. |
+| **Priority** | High |
+
+**Open Questions:**
+- Codec error policy for unknown/trailing chunks is undecided (silently permissive); resolving it changes expected-drop accounting, not no-crash.
+- Does a malformed packet ever cause partial dispatch / mis-routing (ties to `source-dispatch-no-misroute`)?
+- TCP oversized frame `break`s the connection — verify no per-connection resource leak.
+
+### replay-no-panic-on-malformed-capture — Replay never panics on malformed capture
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | Parsing an arbitrary/corrupt/truncated/zstd DogStatsD capture file never panics or aborts the replay process. |
+| **Invariant** | `Unreachable` at any panic/abort in the reader or its zstd/prost decode calls; pair with `Reachable(replay-parse-executed)`. A panic on untrusted input is a crash, invisible to a workload that already expects a non-zero exit — needs SUT-side instrumentation. |
+| **Antithesis Angle** | Untrusted file input + adversarial bytes / zstd-bomb / protobuf-recursion; pure input exploration of an unfuzzed, zero-suite-coverage path. |
+| **Why It Matters** | Newest/largest ADP feature (+1765 LOC, validated only with `cargo check`); reader parses untrusted files inside an ADP process. |
+| **Priority** | High |
+
+**Open Questions:**
+- _Resolved:_ replay runs as a **separate `agent-data-plane dogstatsd replay` CLI process** (`dogstatsd.rs:394`) that forwards records to the running ADP over the DSD UDS socket — so the panic-catch / SUT assertion belongs in the replay CLI process, not the data-plane process.
+- _Resolved (now first-class hazards, not just questions):_ `reader.rs:40-41` `fs::read` loads the whole file with **no size guard** (OOM vector), and `reader.rs:44` `zstd::stream::decode_all` runs on untrusted input with **no decompressed-size cap** (decompression-bomb vector → unbounded memory on a valid-but-huge stream). Both overlap the memory family and are strong workload inputs.
+
+### replay-corruption-not-silent-eof — Corruption distinguishable from clean EOF
+> **Status (2026-05-29): BUG DEMONSTRATED** as a unit test —
+> `lib/saluki-components/src/sources/dogstatsd/replay/reader.rs`
+> `tests::bug_corrupt_length_prefix_silently_drops_following_records`. A corrupt/oversized length
+> prefix is treated as clean EOF, silently dropping all following well-formed records. Not fixed
+> (demonstration only).
+| | |
+|---|---|
+| **Type** | Safety (data fidelity) |
+| **Property** | A corrupt/oversized record length prefix is detectable as truncation, not silently reported as a clean replay completion. |
+| **Invariant** | `AlwaysOrUnreachable(faithful completion)`: when `read_next` terminates with `Ok(None)`, the offset reached the real trailer separator, not an overrunning/mid-stream prefix; `Sometimes(corruption-detected)`. Honestly framed: code intentionally returns `Ok(None)` today (tests assert it). |
+| **Antithesis Angle** | Untrusted file input; flip/zero a length prefix mid-stream and observe the tool report success having sent only N of M records. |
+| **Why It Matters** | The reader collapses legitimate-EOF, truncation, and corruption into one `Ok(None)`; the driver stops sending silently → false replay-fidelity confidence. |
+| **Priority** | Medium |
+
+**Open Questions:**
+- No record-count/total-length field exists — distinguishing truncation from clean EOF may need a format change; determines strict `Always` vs. heuristic.
+- Maintainers may consider silent truncation acceptable (best-effort tool); the asserting tests suggest "accepted." `(needs human input)`
+- A wrong-but-small `size` could decode garbage as a valid record — a third outcome (silent wrong record, not just truncation).
+
+### malformed-event-sc-no-crash — Malformed event / service-check payloads never crash
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | Adversarial/malformed DogStatsD event and service-check payloads on any listener never crash the process or kill a connectionless socket; bad frames are counted and skipped. |
+| **Invariant** | `Always(process up)` + `Always(connectionless socket survives a bad event/SC packet)` + SUT-side `Unreachable` at any panic site in `parse_dogstatsd_event` / `parse_dogstatsd_service_check` and shared `helpers::*`; `Sometimes(event_decode_failed>0)`, `Sometimes(service_check_decode_failed>0)` as anchors. Extends `malformed-dsd-no-crash` to the two separate ~394/~312-LOC codecs (per R1, the no-crash check is SUT-side `Unreachable`, not container liveness). |
+| **Antithesis Angle** | Untrusted event/SC frames across UDP/TCP/UDS: pathological `_e{title_len,text_len}` prefixes with `take()` on attacker lengths (`event.rs:36-49`), invalid UTF-8, the per-packet `.replace("\\n","\n")` heap alloc (amplification under flood), malformed `d:`/`card:` parsers, origin-detection-gated branches. |
+| **Why It Matters** | These codecs are entirely separate from the metric codec, so existing coverage gives no assurance; a codec panic on any listener thread crashes a fail-stop data component → crash-loop under flood. |
+| **Priority** | High |
+
+**Open Questions:**
+- Do the shared `helpers::*` parsers (`unix_timestamp`, `tags`, `cardinality`, `local_data`, `external_data`, `utf8`) contain any panic/unwrap/length-based pre-allocation? (Pivotal for the `Unreachable` guard.)
+- Does `take(title_len)`/`take(text_len)` or the message UTF-8 parser pre-allocate on the declared length before bounds-checking the buffer?
+- Can a malformed frame be mis-classified by `parse_message_type` (`mod.rs:1466`), incrementing the wrong decode-failure counter?
+
+---
+
+## Category F — Concurrency & Boundary Conditions
+
+Interleaving-sensitive paths the deterministic diff-test cannot reach, plus the non-finite-value
+boundary that cross-cuts intake and aggregation.
+
+### interner-reclamation-no-corruption — No corrupt/overlapping interner entries under races
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | Under concurrent intern + drop-last-ref, the interner never returns overlapping or corrupt (`0x21`-filled) entries; worst case is a benign duplicate. |
+| **Invariant** | `Always(no reclaimed entry overlaps a live entry / no live &str reads the 0x21 sentinel)`, i.e. `Unreachable` on the corruption-detected branch; `Sometimes(reclaimed-slot reused)`, `Sometimes(drop re-check found resurrected entry)`. SUT-side. |
+| **Antithesis Angle** | Concurrency/interleaving under the real scheduler + real load with many shards/entries on a near-full interner — explores beyond the bounded loom model. |
+| **Why It Matters** | Most concurrency-unsafe component in the bounded-memory story; raw pointers as `'static &str` into a buffer overwritten on reclaim; loom only bounds interleavings. |
+| **Priority** | High |
+
+**Open Questions:**
+- Confirm both the `try_intern` increment and the drop re-check take the same `InternerState` mutex (only race window is decrement→lock, which the re-check covers).
+- Cross-shard handles: can a shard-A `InternedString` ever be dropped against shard-B's lock?
+- _Resolved:_ the reclamation buffer-fill IS present in release (no cfg gate). **Two implementations use different sentinels:** `map.rs:392` fills `0x21`, `fixed_size.rs:458` fills `0xAA`. An assertion must use the correct sentinel per implementation, or check overlap directly (implementation-independent — preferred).
+
+### non-finite-values-handled-consistently — Non-finite values consistent, no ghost metric
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | Non-finite metric values never crash; an all-non-finite packet produces no downstream metric and consumes no interner/cache resources; no NaN reaches a DDSketch. |
+| **Invariant** | `Always(value.is_finite() at DDSketch insert boundary)` + `AlwaysOrUnreachable(no zero-point metric reaches aggregation)` + `Sometimes(non-finite dropped)`. Add `Sometimes(ghost-metric path)` ONLY if a codec-bypassing producer is confirmed. |
+| **Antithesis Angle** | Untrusted input: all-NaN/Inf and mixed packets across all metric types, checking the source `num_points==0` gate and the sketch boundary. |
+| **Why It Matters** | NaN-poisons-sketch and ghost-metric hazards. Honestly framed: the **ghost-metric** shape (zero-point metric) is DSD-`FloatIter`-specific and gated (`num_points==0 → Ok(None)` at `handle_frame` `mod.rs:1478`), so on the DSD path it is expected Unreachable. The **NaN-poison** facet, however, is LIVE via a non-DSD producer (see `ddsketch-no-nan-poison`). |
+| **Priority** | Medium |
+
+**Open Questions:**
+- Do the empty-iter `*_fallible` constructors return Err on empty input, or an empty value?
+- _Resolved:_ the NaN-reaches-sketch hazard is LIVE via `checks_ipc` Histogram metrics (encoder `insert_n`), while OTLP, replay (re-injects through DSD codec), and the DSD aggregate path are all finiteness-gated. The ghost-metric (zero-point) shape stays DSD-specific and gated → expected Unreachable on the DSD path.
+
+---
+
+## Category G — Transform & Enrichment Correctness
+
+Added after evaluation. The other categories treat ADP as a *transport* (don't crash, don't lose
+bytes); this one treats it as a *transformer* — mapping, filtering, and tag-filtering customer data,
+much of it driven by **runtime config that mutates while data flows**. This is the the design partner design-
+partner's documented focus (the "Tag Filter RC Relay Stress Test"). These properties need the
+**config-stream add-on topology** (not standalone) and/or the **diff-test add-on** (Agent baseline).
+
+### mapper-output-matches-agent — DogStatsD mapper output matches the Datadog Agent
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | For the same input names and `dogstatsd_mapper_profiles`, ADP's mapper produces the same rewritten metric name and injected tags as the Datadog Agent mapper. |
+| **Invariant** | Differential `Always(mapped (name,tags) within ratio of Agent mapper)` per flush window on the `panoramic`/`stele` harness with a mapper-exercising corpus + identical profiles; `Sometimes(a metric was remapped)` for non-vacuity; SUT-side `Sometimes(cache hit == fresh miss)`. Mirrored expansion/wildcard logic (`dogstatsd_mapper/mod.rs:259-342`) is only self-tested today. |
+| **Antithesis Angle** | Overlapping/ambiguous profiles probing first-match ordering, names at the wildcard char-class boundary, and fault-induced flush-timing skew; runs on the diff-test add-on. |
+| **Why It Matters** | Mapper rename feeds every downstream filter (`run.rs:674-675`); a divergence is silent, customer-visible name/tag corruption. |
+| **Priority** | High |
+
+**Open Questions:**
+- Does the Agent apply all matching mappings per profile, or only the first (ADP returns on first match, `mod.rs:332`)? `(needs human input)`
+- Does the Agent restrict wildcard chars to the same `[a-zA-Z0-9\-_*.]` class?
+- The mapper has no `watch_for_updates` — it appears static-only, so its runtime-config facet is likely unreachable (unlike the filters).
+
+### mapper-interner-bounded — The mapper's second interner is bounded / fails visibly
+| | |
+|---|---|
+| **Type** | Safety (heap-off: silent non-remap hazard; heap-on default: bounded claim fails by design) |
+| **Property** | The mapper's own 64 KiB string interner stays bounded; when full, the metric is not silently forwarded under its original (unmapped) identity without accounting. |
+| **Invariant** | Heap-off: `AlwaysOrUnreachable(mapper interner full ⇒ metric forwarded under ORIGINAL context, accounted)` + `Sometimes(mapper resolve == None)`. Heap-on (default): `Sometimes(mapper intern heap fallback > 0)`. SUT-side needed (no counter). The `?` at `mod.rs:317-321`/`:277-282` makes resolve-`None` → keep original context with no drop/else. |
+| **Antithesis Angle** | Small `dogstatsd_mapper_string_interner_size` + a flood of distinct mappable names fills the mapper interner independently of the source interner; resolver-churn scheduling stress (30s idle expiry). |
+| **Why It Matters** | A SECOND bounded interner (distinct from `interner-full-bounded`) whose default heap-on voids its declared firm bound, and whose heap-off path silently emits the wrong (pre-map) identity — a load-dependent identity flip. |
+| **Priority** | High |
+
+**Open Questions:**
+- The mapper never calls `with_heap_allocations(false)` (defaults true) — intentional, or an oversight making its firm bound unenforceable? `(needs human input)`
+- Can the same name be remapped on one call but silently not on the next purely due to interner pressure?
+
+### filter-config-reload-correct — Live filter config reload applies correctly, never stale or silently cleared
+| | |
+|---|---|
+| **Type** | Safety (data-correctness under live config reload) |
+| **Property** | When the Core Agent pushes filter config over the RC stream while metrics flow, ADP applies the new filters to live data — never keeping stale filters nor silently clearing all filtering. |
+| **Invariant** | `Always(after a settled filter update, the next metric is filtered per the NEW config)` (SUT-side at apply); `Unreachable("filter update Lagged-dropped with no reconciliation")` on `watcher.rs:61` (no re-read exists); `AlwaysOrUnreachable(tag filtering not silently fully-cleared by an unintended event)`; `Sometimes(reload while metrics in flight)`. Three confirmed hazards: (1) `broadcast::Lagged` warn+continue on a cap-100 broadcast → permanent staleness; (2) partial-deserialize skip; (3) `diff_recursive` is additive so key **deletion fires no event** (stale), while explicit-empty clears `tag_filterlist` but is ignored by the prefix/post-agg filters. |
+| **Antithesis Angle** | Burst config updates faster than the filter task drains its receiver (node-throttle `adp` to widen the lag window) interleaved with sustained metric load; explore deletion vs explicit-empty vs malformed-entry. **The design partner's explicit stress-test focus.** |
+| **Why It Matters** | Stale/half-applied/cleared filtering on live customer data is silent and customer-visible — the single most product-relevant transform property. |
+| **Priority** | High |
+
+**Open Questions:**
+- Is the `Lagged` drop accepted as best-effort (a dropped final update is permanent staleness, no re-read)? `(needs human input)`
+- Should removing a filter key from RC clear the filter? ADP's additive diff keeps it stale — confirm vs Agent RC semantics. `(needs human input)`
+- The `tag_filterlist` (clears on `None`) vs prefix/post-agg (ignores `None`) asymmetry — intended?
+- Requires the config-stream add-on; does **not** fire in standalone mode.
+
+### tag-filterlist-applied-consistently — tag_filterlist applies cached and type-gated filtering consistently
+| | |
+|---|---|
+| **Type** | Safety |
+| **Property** | tag_filterlist's context-cache results always agree with a fresh computation, only Counter+sketch metrics are filtered, and output matches the Agent's time-sampler tag filtering. |
+| **Invariant** | `Always(cache-hit filtered tags == fresh filter_metric_tags)` (SUT-side, catches stale/colliding cache entries, `mod.rs:240-263`); `AlwaysOrUnreachable(only Counter+sketch metrics have tags removed)` (`mod.rs:235-237`); optional differential `Always(post-filter (name,tags) within ratio of Agent)`; `Sometimes(tag removed)` + `Sometimes(cache hit served filtered result)`. |
+| **Antithesis Angle** | Mixed-type load with overlapping names stressing the 100k/30s-TTI context cache, interleaved with reloads (compose with `filter-config-reload-correct`) + node-throttling to widen the reload-vs-apply window so a cache entry outlives the reload that should invalidate it. |
+| **Why It Matters** | Counter+sketch-only gate and a hot-path cache, both claiming Agent equivalence with only self-tests; a stale entry or wrong type-gate silently leaks/drops tags (cardinality/PII). |
+| **Priority** | Medium (High if the type gate diverges from the Agent) |
+
+**Open Questions:**
+- Does the Agent filter the same Counter+sketch subset? `(needs human input)`
+- Cache keyed by full `Context` incl. origin tags — can origin-tag-only differences collide?
+- If a reload is Lagged-dropped, the cache is not rebuilt and stale filtered contexts persist to TTI — confirm this compound failure.
+
+### prefix-filter-ordering-matches-agent — Prefix/blocklist and post-aggregate filtering match the Agent's stage split
+| | |
+|---|---|
+| **Type** | Safety (ordering + differential) |
+| **Property** | ADP's listener-side prefix/blocklist filter and post-aggregate histogram-series filter run in the correct pipeline order and split responsibility over the shared keys exactly as the Datadog Agent does. |
+| **Invariant** | Differential `Always(end-to-end keep/drop + final name within ratio of Agent)`; `AlwaysOrUnreachable(non-histogram-shaped entry not applied post-aggregate)` and converse; `AlwaysOrUnreachable(post_agg never drops a sketch)`; optional blueprint-shape `Always(dsd_prefix_filter between dsd_enrich and dsd_tag_filterlist; dsd_post_agg_filter after dsd_agg)` (`run.rs:674-679`); `Sometimes(prefix added/listener drop/post-agg drop)`. |
+| **Antithesis Angle** | Corpus where a name's keep/drop depends on stage order + fault-induced flush-timing skew on the diff run; composes with `mapper-output-matches-agent` and `filter-config-reload-correct`. |
+| **Why It Matters** | A past fix "moved DSD prefix/filter in front of enrich" (bug-history-sensitive); the listener-vs-time-sampler split over four shared keys is subtle, fragile, and has no end-to-end regression guard. |
+| **Priority** | Medium (High as the ordering-regression tripwire) |
+
+**Open Questions:**
+- Does the Agent split on exactly the `<metric>.<aggregate-suffix>` shape `contains_filter_entry` uses? `(needs human input)`
+- Is the prefix-filter-after-mapper ordering load-bearing for equivalence, with any guard besides this property?
+- A reload updating one filter but lagging the other could filter at one stage but not the other for the same rule — confirm reachability.
+
+## Catalog-wide notes
+
+- **Default config is hostile to the bounded-memory family:** memory limiter disabled
+  (`MemoryMode::Disabled`), interner heap-spill enabled (`dogstatsd_allow_context_heap_allocs=true`),
+  disk retry persistence off. Workloads must opt into protective settings to test the "holds"
+  branches and leave defaults to capture the "fails by design" branches. This is the single most
+  important workload-configuration decision (see `deployment-topology.md`).
+- **One guaranteed-crash finding** needs no fault injection, only clock exploration:
+  `aggregate-clock-skew-stable` (forward jump → flood) — cheap, high-value first target. Its former
+  sibling `aggregate-no-panic-any-window` (sub-second window → divide-by-zero) was **fixed upstream**
+  (window is now `NonZeroU64`, PR #1772) and is retained only as a regression tripwire.
+- **Confirmed-live latent bug:** `ddsketch-no-nan-poison` is reachable today via a `checks_ipc`
+  Histogram metric carrying NaN, which bypasses the per-source finiteness filters and poisons the
+  encoder sketch's sum/avg permanently. The fix belongs at the sketch boundary.
+- **Resource exhaustion via untrusted input:** `replay-no-panic-on-malformed-capture` carries two
+  confirmed vectors — unguarded whole-file `fs::read` and uncapped `zstd::decode_all` — both reachable
+  in the separate replay CLI process.
+- **Differential property:** `aggregate-matches-agent` is anchored on the existing
+  `panoramic`/`stele` diff harness, not an in-process SDK assertion.
+- **SUT-side instrumentation is required** (not optional) for: all crash/panic properties, NaN-at-
+  sketch-boundary, bin-count, interner-corruption, source-misroute, and limiter-RSS-failure — these
+  internal states are invisible to a workload-only checker. Existing telemetry counters
+  (`events_discarded_total`, `framing_errors`, `*_decode_failed`, queue-drop counters) serve only as
+  `Sometimes`-reachability anchors, not as the safety assertions themselves.
+- **(R1) The container's s6 supervisor auto-restarts ADP on exit**, so "process is up" workload
+  assertions are vacuously green even during a crash-loop. Every no-crash property
+  (`malformed-dsd-no-crash`, `malformed-event-sc-no-crash`, `replay-no-panic-on-malformed-capture`,
+  the aggregate-crash pair) must assert SUT-side `Unreachable` at panic sites — or assert on
+  restart-count — **never** container liveness.
+- **(R2, updated 2026-05-30) The Antithesis Rust SDK is now wired into ADP** behind the `antithesis`
+  cargo feature (`antithesis_init()` + a bootstrap `assert_reachable!`), and the harness binaries
+  carry workload-side anchors — so the "fork ADP + add the SDK + build an instrumented image"
+  prerequisite is largely satisfied (the wiring is proven). ~17 properties still need their net-new
+  in-process SUT-side **invariant** assertions landed on top of that scaffold; the ~10 workload-only
+  properties (forwarder delivery, retry-queue bounds, shutdown, config-gate, RSS) can run first.
+- **(R3) No-loss properties must use TCP or UDS ingress, not UDP** — UDP's inherent packet loss
+  confounds any "accepted == delivered" reconciliation (`no-silent-interconnect-drop`,
+  `forwarder-eventual-delivery`, `disk-persisted-retry-survives-restart`, `shutdown-drains-no-loss`,
+  `events-sc-no-silent-loss`). Reserve UDP for the no-crash properties.
+- **(R4) Anti-vacuity:** safety properties gated by hard-to-reach `Sometimes` anchors (bin collapse,
+  interner resurrection race, events/SC reachability) require the workload to force the anchor
+  config/corpus, and the run synthesizer must report an **unreached `Sometimes` as inconclusive, not
+  passing**.
+- **(G2 topology dependency)** the runtime filter config-reload properties
+  (`filter-config-reload-correct`, and the reload facets of the tag-filterlist/prefix-filter
+  properties and `config-runtime-update-not-revalidated`) require the **config-stream add-on
+  topology** (Core Agent or stub) — they pass vacuously in standalone mode because the config watcher
+  never fires.
+
+## Scope (confirmed with user, 2026-05-28)
+
+- **In scope:** the DogStatsD pipeline end-to-end — metrics, events, service-checks — plus the
+  `saluki-core` runtime invariants (memory bounds, backpressure, lifecycle, pooling, interning) and
+  the runtime filter config-reload surface.
+- **Deferred (documented exclusion, not an oversight):** the **traces/APM, logs, and OTLP** pipelines
+  (`run.rs:506-591,700-758`). These are wired in ADP but are not the first-customer (the design partner / Agent
+  7.80.0) surface. They carry their own untrusted-input risk (notably the SQL-parsing
+  `trace_obfuscation/sql.rs` and a second OTLP protobuf source + forwarder) and are the natural next
+  expansion if/when they enter scope.
+- **Fault availability (confirmed enabled for the tenant):** **node termination**, **clock jitter**,
+  and **custom `/proc` faults** are all enabled — so the crash-recovery
+  (`disk-persisted-retry-survives-restart`, `data-component-failure-triggers-process-shutdown`),
+  clock-skew (`aggregate-clock-skew-stable`), and limiter-RSS-failure
+  (`memory-limiter-survives-rss-read-failure`) properties are realizable rather than vacuous.
+
+## Open Questions (catalog-wide / analysis-level)
+
+- "METRIC_CONTROL relay" naming from Confluence has no source identifier — config flows through the
+  generic snapshot/partial stream. Confirm with the team.
+- _Resolved:_ `interconnect_capacity` default = **128** event buffers (`topology/mod.rs:37`).
+- _Resolved:_ no protective memory setting is on by default (`memory_mode` = `Disabled`,
+  `allow_context_heap_allocs` = true) → "bounded memory" is opt-in/aspirational under default config.
+  This pins Category A's "fails by design under defaults" framing.
diff --git a/test/antithesis/scratchbook/property-relationships.md b/test/antithesis/scratchbook/property-relationships.md
new file mode 100644
index 00000000000..866de006179
--- /dev/null
+++ b/test/antithesis/scratchbook/property-relationships.md
@@ -0,0 +1,187 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence space — headline guarantees and gap analyses that seed properties.
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/pages/6497671050/What+Comes+After+DogStatsD
+    why: Root guarantee for the memory + data-loss clusters.
+  - path: https://datadoghq.atlassian.net/browse/DADP
+    why: ADP Jira project for tracked gaps/regressions.
+  - path: https://github.com/DataDog/saluki/pull/1768
+    why: PR review #4393897611 (Copilot) flagged the stale property count reconciled here.
+---
+
+# Property Relationships
+
+Lightweight clustering of the 35 catalog properties by shared code paths, failure mechanisms, and
+suspected dominance. Slugs match `property-catalog.md`.
+
+## Cluster 1 — Bounded memory (the determinism story)
+
+Properties: `rss-bounded-under-cardinality`, `aggregate-context-limit-enforced`,
+`interner-full-bounded`, `memory-limiter-survives-rss-read-failure`,
+`retry-queue-bounded-under-outage`, and (memory facet) `aggregate-clock-skew-stable`.
+
+- **Shared mechanism:** every one bears on whether actual RSS stays within the grant. They share the
+  `resource-accounting` limiter, the aggregate `HashMap`/`context_limit`, and the `stringtheory`
+  interner.
+- **Dominance:** `rss-bounded-under-cardinality` is the **roll-up** — it observes the aggregate
+  outcome (RSS ≤ grant). The other four explain *why* it does or doesn't hold:
+  `aggregate-context-limit-enforced` and `interner-full-bounded` are the two designed bounds;
+  `interner-full-bounded` (heap-on default) and `memory-limiter-survives-rss-read-failure` are the
+  two leaks that make the roll-up fail. If `rss-bounded` passes, the sub-properties likely hold; if
+  it fails, the sub-properties localize the cause. Test the roll-up *and* the components.
+- **Config coupling:** all are sensitive to the same three default-off protective settings
+  (memory_mode, allow_context_heap_allocs, disk persistence). A single workload-config matrix drives
+  the whole cluster.
+
+## Cluster 2 — Egress data loss & durability
+
+Properties: `forwarder-eventual-delivery`, `retry-queue-bounded-under-outage`,
+`disk-persisted-retry-survives-restart`, plus the egress facet of `shutdown-drains-no-loss`.
+
+- **Shared code:** all live in `common/datadog/io.rs` + `retry/queue/persisted.rs` (PendingTransactions,
+  circuit breaker, disk queue).
+- **Tension (not dominance):** `forwarder-eventual-delivery` (no loss) and
+  `retry-queue-bounded-under-outage` (bounded memory ⇒ eventual drop) are in direct tension — the
+  queue cap is the escape valve that *causes* the loss the delivery property forbids. They must be
+  tested with coordinated outage durations: short outage → delivery holds; prolonged outage →
+  bounded-drop holds. `retry-queue-bounded` is the safety backstop; `forwarder-eventual-delivery` is
+  the liveness goal within the backstop's budget.
+- **`disk-persisted-retry-survives-restart`** extends `forwarder-eventual-delivery` across a crash;
+  it shares the kill+restart fault with `aggregate-matches-agent` (restart facet).
+
+## Cluster 3 — No silent internal loss / routing
+
+Properties: `no-silent-interconnect-drop`, `source-dispatch-no-misroute`,
+`data-component-failure-triggers-process-shutdown`.
+
+- **Shared mechanism:** the `topology/interconnect` dispatcher and the DSD source's multi-output
+  dispatch. All three are about events going to the *right place or nowhere silently*.
+- **Connection:** `no-silent-interconnect-drop` (backpressure, no discard on wired edges) and
+  `source-dispatch-no-misroute` (no cross-output leakage) are complementary halves of "events are
+  routed correctly under load/failure." `data-component-failure-triggers-process-shutdown` is the
+  backstop: when routing/processing *does* fail, the whole process must stop rather than run half-broken.
+
+## Cluster 4 — Aggregation correctness
+
+Properties: `aggregate-matches-agent`, `aggregate-no-panic-any-window`, `aggregate-clock-skew-stable`,
+`ddsketch-bin-count-bounded`, `ddsketch-relative-error-bound`, `ddsketch-no-nan-poison`.
+
+- **Shared code:** `transforms/aggregate/mod.rs` + `lib/ddsketch`.
+- **Dominance:** `aggregate-matches-agent` is the **differential roll-up** — any sketch-accuracy,
+  bin-count, NaN, clock-skew, or bucketing deviation that changes output will show up as a diff
+  against the Agent (if the Agent doesn't share the same deviation). The sub-properties catch
+  deviations that are silent in a single happy-path comparison (merge-order accuracy, NaN poison,
+  internal bin explosion).
+- **Crash subset:** the forward-jump facet of `aggregate-clock-skew-stable` is the live crash/DoS
+  property that feeds `data-component-failure-triggers-process-shutdown` (a panicking aggregate is
+  exactly the "component finishes unexpectedly" trigger). `aggregate-no-panic-any-window` shares the
+  cluster but its original `% 0` panic vector was **closed upstream** (window is now `NonZeroU64`,
+  PR #1772) — it remains as a low-cost `Unreachable` regression tripwire, not an active crash bug.
+- **NaN crosscut:** `ddsketch-no-nan-poison` shares its boundary with
+  `non-finite-values-handled-consistently` (Cluster 6) — same NaN, two angles (sketch internals vs.
+  source gating).
+
+## Cluster 5 — Lifecycle & config gating
+
+Properties: `topology-ready-before-intake`, `config-stall-no-deadlock`,
+`config-incompatible-refuses-start`, `config-runtime-update-not-revalidated`,
+`graceful-shutdown-within-30s`, `data-component-failure-triggers-process-shutdown`.
+
+- **Shared code:** `bin/agent-data-plane/src/cli/run.rs`, `internal/remote_agent.rs`,
+  `saluki-core/health`, `topology/running.rs`, `runtime/supervisor.rs`.
+- **Lifecycle ordering:** `topology-ready-before-intake` (startup) and `graceful-shutdown-within-30s`
+  (shutdown) bracket the run; `config-incompatible-refuses-start` and `config-stall-no-deadlock`
+  gate whether startup proceeds at all.
+- **Config-gate pair:** `config-incompatible-refuses-start` (startup gate) and
+  `config-runtime-update-not-revalidated` (the runtime hole in that same gate) are two halves of one
+  guarantee — incompatible config is refused. The first holds; the second is the gap. They share
+  `check_and_warn_config` / `ConfigClassifier` and the `run.rs` callsite.
+- **Shutdown pair:** `graceful-shutdown-within-30s` (timing/clean completion) and
+  `shutdown-drains-no-loss` (Cluster 2; what data survives) are deliberately split views of the same
+  shutdown event — test together, assert on different things. Neither dominates.
+
+## Cluster 6 — Untrusted input parsing
+
+Properties: `malformed-dsd-no-crash`, `replay-no-panic-on-malformed-capture`,
+`replay-corruption-not-silent-eof`, `non-finite-values-handled-consistently`.
+
+- **Shared mechanism:** parsing attacker-influenced bytes (`saluki-io` DSD codec; replay reader).
+- **No-crash subset:** `malformed-dsd-no-crash` and `replay-no-panic-on-malformed-capture` are the
+  same property class (untrusted input never crashes) on two different parsers; both feed the
+  fail-stop backstop in Cluster 5.
+- **Fidelity vs. crash:** `replay-corruption-not-silent-eof` is about *silent wrong data* rather than
+  crash — a distinct failure mode on the same reader as `replay-no-panic-on-malformed-capture`.
+- **`non-finite-values-handled-consistently`** bridges to Cluster 4 via the NaN/sketch boundary.
+
+## Cluster 7 — Concurrency interleavings
+
+Properties: `interner-reclamation-no-corruption`, plus the timing facets of
+`no-silent-interconnect-drop` (multi-sender partial delivery), `forwarder-eventual-delivery`
+(shared circuit-breaker state), and `aggregate-context-limit-enforced` (breach flag vs. flush race).
+
+- **Shared theme:** these are the properties where Antithesis's interleaving search is the *primary*
+  value (vs. fault injection). `interner-reclamation-no-corruption` is the purest — a loom-tested
+  unsafe path where Antithesis explores beyond the bounded model. The others are concurrency facets
+  of properties whose main home is another cluster.
+
+## Cluster 8 — Transform & enrichment correctness (added after evaluation)
+
+Properties: `mapper-output-matches-agent`, `mapper-interner-bounded`, `filter-config-reload-correct`,
+`tag-filterlist-applied-consistently`, `prefix-filter-ordering-matches-agent`.
+
+- **Shared code:** the DSD transform chain — `dogstatsd_mapper` → `dogstatsd_prefix_filter` →
+  `dsd_tag_filterlist` → `dsd_agg` → `dsd_post_agg_filter` (`run.rs:674-679`) — plus the runtime
+  config watcher (`saluki-config/src/dynamic/watcher.rs`).
+- **Differential subset:** `mapper-output-matches-agent`, `prefix-filter-ordering-matches-agent`, and
+  the optional facet of `tag-filterlist-applied-consistently` all ride the **diff-test add-on** and
+  are facets of `aggregate-matches-agent` (Cluster 4) at earlier pipeline stages — that roll-up
+  catches them only if its corpus exercises mapped/filtered metrics (an open question).
+- **Config-reload subset:** `filter-config-reload-correct` is the hub — `tag-filterlist-applied-
+  consistently` (stale cache after a Lagged-dropped reload) and `config-runtime-update-not-revalidated`
+  (Cluster 5) compose with it. All require the **config-stream add-on topology**; all pass vacuously
+  in standalone mode.
+- **Interner crosscut:** `mapper-interner-bounded` is a second, independent instance of
+  `interner-full-bounded` (Cluster 1) — a distinct 64 KiB interner with its own silent-drop path.
+- **Bug-history crosscut:** `prefix-filter-ordering-matches-agent` targets the "moved prefix/filter in
+  front of enrich" fix; it is the ordering-regression tripwire for the whole chain.
+
+## Cluster 9 — Events & service-checks paths (added after evaluation)
+
+Properties: `events-sc-no-silent-loss`, `malformed-event-sc-no-crash`, `events-sc-pipeline-reachable`.
+
+- **Shared code:** the always-on `dsd_in.{events,service_checks}` sub-pipelines and their separate
+  codecs/encoders. These are the events/service-checks analogues of Cluster 3's `no-silent-interconnect-drop`,
+  Cluster 2's `forwarder-eventual-delivery`, and Cluster 6's `malformed-dsd-no-crash` — same
+  mechanisms, different always-on edges.
+- **Anti-vacuity dependency:** `events-sc-pipeline-reachable` is the R4 anchor that keeps the other
+  two from passing trivially under a metrics-dominated workload — a hard dependency, not just a relation.
+
+## Shared-scenario pairs (R10 — count is not 35 independent test efforts)
+
+These pairs share a fault scenario / assertion site and should be implemented together; treat them as
+one test effort each for portfolio-sizing:
+
+- `shutdown-drains-no-loss` ⇄ `graceful-shutdown-within-30s` — same shutdown event, different assertions
+  (what data survives vs. clean completion in time).
+- `non-finite-values-handled-consistently` ⇄ `ddsketch-no-nan-poison` — same NaN, two angles (source
+  gating vs. sketch-boundary poison).
+- `rss-bounded-under-cardinality` ⇄ its four Cluster-1 sub-properties — roll-up vs. localizers.
+- `aggregate-matches-agent` ⇄ its Cluster-4 sub-properties and the Cluster-8 differential subset —
+  roll-up vs. localizers/earlier-stage facets.
+- `config-incompatible-refuses-start` ⇄ `config-runtime-update-not-revalidated` — startup gate vs. its
+  runtime hole.
+
+## Cross-cutting observations
+
+- **One config matrix drives many clusters.** The memory_mode / heap-allocs / disk-persistence
+  settings gate Clusters 1 and 2; the protective-on vs. default-off split is the master test variable.
+- **Two roll-up properties** (`rss-bounded-under-cardinality`, `aggregate-matches-agent`) each
+  dominate a cluster — they're cheap to assert and catch broad regressions, but the sub-properties
+  localize causes and catch silent-but-output-neutral deviations. Keep both levels.
+- **Fail-stop is the shared backstop.** `data-component-failure-triggers-process-shutdown` is
+  downstream of every crash property (Clusters 4, 6) — when any invariant trips a panic, this is what
+  must happen next. It belongs to Cluster 5 but connects to all crash properties.
diff --git a/test/antithesis/scratchbook/sut-analysis.md b/test/antithesis/scratchbook/sut-analysis.md
new file mode 100644
index 00000000000..d3d6e761983
--- /dev/null
+++ b/test/antithesis/scratchbook/sut-analysis.md
@@ -0,0 +1,307 @@
+---
+sut_path: /home/ssm-user/src/saluki
+commit: fc4bb29728814ddf9321572b954ec28f58faeb53
+updated: 2026-05-30
+external_references:
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/
+    why: ADP Confluence space — headline guarantees, Phase 1 bug bash, gap analyses, weekly summaries.
+  - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/pages/6497671050/What+Comes+After+DogStatsD
+    why: Source of the "ADP will not crash under load, losing customer data" guarantee.
+  - path: https://datadoghq.atlassian.net/browse/DADP
+    why: ADP Jira project for tracked gaps/incidents (e.g. DADP-108 macOS gaps, DADP-45 telemetry compat).
+---
+
+# SUT Analysis: Agent Data Plane (ADP)
+
+> Synthesis of a 5-agent discovery ensemble (12 attention focuses) over `bin/agent-data-plane`
+> and the `saluki-*` libraries, grounded against the Datadog ADP Confluence/Jira spaces.
+> Each major finding notes the focus(es) that surfaced it.
+
+## 1. What ADP is (product context)
+
+**Agent Data Plane (ADP)** is a Rust reimplementation of the Datadog Agent's telemetry data
+paths — primarily the **DogStatsD pipeline** — built on the **Saluki** toolkit. It runs
+*alongside* the Datadog Core Agent (not standalone in production; `standalone` mode is a
+vestige, `AGENTS.md:31`). It is being delivered to first customers for the the design partner design
+partner, targeting Agent **7.80.0**; `data_plane.enabled: true` routes DogStatsD through ADP.
+
+Stated priorities (`AGENTS.md:13`): **correctness first, then performance.** Marketed benefit:
+**deterministic resource usage** (`docs/agent-data-plane/index.md:1-6`).
+
+**Headline guarantee (Confluence, "What Comes After DogStatsD"): "ADP will not crash under
+load, losing customer data."** This single sentence decomposes into the two property families
+that dominate this analysis: *no crash / bounded memory* and *no silent data loss*.
+
+User-visible failure modes: lost metrics (silent drops), wrong aggregates (counter→rate, sketch
+error, bucket misalignment), process crash (panics on hot paths), memory blowup (OOM).
+
+## 2. Architecture & data flow (Focus 1, 9)
+
+### Component topology model
+
+ADP is a **streaming topology** of typed components wired in a blueprint, built on `saluki-core`:
+
+- **Component kinds:** `source` → `transform`/`relay`/`decoder` → `encoder` → `forwarder`/`destination`.
+- **Two data-model channel types** on edges (`topology/interconnect`, `built.rs:507-647`):
+  `EventsBuffer` (source/decoder/transform → transform/encoder/destination) and
+  `PayloadsBuffer` (encoder/relay → forwarder/decoder).
+- **Build → spawn lifecycle** (`run.rs:238-239`): `TopologyBlueprint` is assembled, then
+  `build()` then `spawn()`. The topology starts accepting data only after `health_registry.all_ready()`.
+
+### The DogStatsD pipeline (production-critical path, `run.rs:593-698`)
+
+Source `dsd_in` has three named outputs — `metrics`, `events`, `service_checks`:
+
+```
+dsd_in.metrics  → dsd_enrich (chained: dogstatsd_mapper) → dsd_prefix_filter
+                → dsd_tag_filterlist → dsd_agg (aggregate) → dsd_post_agg_filter
+                → metrics_enrich (host_enrichment + host_tags) → dd_metrics_encode → dd_out
+dsd_in.metrics  → dsd_stats_out  (statistics tap)   [+ dsd_debug_log_out if enabled]
+dsd_in.events   → events_enrich → dd_events_encode → dd_out
+dsd_in.service_checks → service_checks_enrich → dd_service_checks_encode → dd_out
+```
+
+`dd_out` is the **Datadog forwarder** to the Datadog intake platform. Tag filtering happens
+both before aggregation (`dsd_tag_filterlist`) and after (`dsd_post_agg_filter`). OTLP metrics
+deliberately **skip aggregation** to avoid turning counters into rates (`run.rs:751-753`).
+
+### Listeners & intake (Focus 1, 9)
+
+DogStatsD source (`sources/dogstatsd/mod.rs`) listens on UDP (8125), TCP, **UDS datagram**, and
+**UDS stream**. `SO_REUSEPORT` UDP autoscaling on Linux (`mod.rs:667-686`). DNS resolved via
+`tokio::net::lookup_host` (skipped when `non_local_traffic=true`). For connectionless sockets,
+framing/I/O errors **clear the buffer and continue** — a malformed packet never kills the socket
+(`mod.rs:1283-1318`); per-frame parse errors are logged and skipped. Origin detection uses UDS
+peer credentials; credential errors are counted but **do not drop the packet**.
+
+### Egress: the Datadog forwarder (Focus 8, 9 — high subtlety)
+
+`TransactionForwarder` (`common/datadog/io.rs`): one main I/O loop fans transactions to **one task
+per resolved endpoint** (own bounded `mpsc(8)` channel + own retry queue). Per-endpoint Tower
+stack: set URI/API-key → version headers → `concurrency_limit` → **`RetryCircuitBreakerLayer`** →
+HTTP client.
+
+- **Retry model is a circuit breaker, not inline retry.** When the policy says "retry," the
+  breaker returns `Error::Open`, arms a shared **exponential-with-jitter** backoff, and the request
+  is re-enqueued to a **low-priority** queue (`io.rs:468-474`). All calls are rejected until backoff
+  elapses.
+- **Retry classification** (`classifier/http.rs`): transport errors + 5xx + most 4xx retry;
+  **400/401/403/413 are NOT retried** — treated as permanent failure and **dropped** (data loss).
+- **Two-tier `PendingTransactions`:** high-priority in-memory `VecDeque` for fresh data, low-priority
+  `RetryQueue` for retries + overflow; **oldest dropped on overflow** (bias to freshest data),
+  counted as `track_queue_drops` telemetry — silent loss.
+- **Disk persistence** (`retry/queue/persisted.rs`): if `forwarder_retry_queue_storage_max_size > 0`,
+  the retry queue persists to disk; **init failure silently falls back to in-memory only** (durability
+  downgrade). On shutdown, pending txns flush to disk; **without disk persistence they are dropped**.
+  Retry-queue IDs are built to survive API-key rotation.
+
+### Control plane / config (Focus 1, 9)
+
+ADP registers as a Core Agent "remote agent" (`internal/remote_agent.rs`) and receives an
+**authoritative config over a gRPC config stream** (snapshot + partial updates). Startup **blocks**
+on `dynamic_config.ready().await` for the first config (`run.rs:119-121`, *no timeout shown*). Stream
+end → reconnect after fixed **5s**. ADP exposes Status/Flare/Telemetry gRPC services back to the
+Agent. **Note:** the grounding mentions a "METRIC_CONTROL relay," but no `METRIC_CONTROL` identifier
+exists in the tree — config control flows through the generic snapshot/partial mechanism (open
+question, see §9).
+
+### Supervision (Focus 1, 3)
+
+Erlang/OTP-style `Supervisor` (`runtime/supervisor.rs`) with `OneForOne`/`OneForAll` restart
+strategies bounded by intensity/period. **Crucial split:** the **internal supervisor** (control
+plane, internal telemetry, env/workload) *is* supervised/restartable, but **the primary data
+topology is NOT** — `RunningTopology` spawns each data component into a `JoinSet` with **no restart**.
+Any data component finishing → `wait_for_unexpected_finish` → **whole-process shutdown** (`run.rs:280-283`,
+`running.rs:40-51`). Data-plane components are **fail-stop**; recovery is full-process restart (the s6
+supervisor in the container restarts ADP on exit). Init failures never restart; only runtime failures do.
+
+## 3. State management & persistence (Focus 2)
+
+- **Aggregation state** (`transforms/aggregate/mod.rs`): a single `HashMap<Context, AggregatedMetric>`
+  owned exclusively by the transform's task — **no locks, single-task ownership**, all mutation `&mut self`.
+  Hard `context_limit` (default **1,000,000**) enforced at insert: a *new* context over the cap is
+  **dropped** (existing contexts always merge). This is the central memory-determinism lever.
+- **Zero-value counter keep-alive:** flushed counters emit zeros until `counter_expiry_seconds`
+  (default 300s); kept-alive contexts still count against the limit.
+- **Shutdown drop-by-design:** open (current-window) buckets are flushed on shutdown **only if
+  `flush_open_windows`** (default **false**) — by default in-flight open-bucket data is dropped to
+  avoid double counting on restart. `PassthroughBatcher` (pre-timestamped metrics) buffers up to
+  `passthrough_idle_flush_timeout` (default 1s); drops events if its buffer stays full.
+- **Restart loses all state:** supervisor restart re-runs `initialize()` from the spec template; a
+  restarted component starts empty. (Mostly moot for data components, which aren't restarted.)
+- **Two disk-backed subsystems only:** the forwarder retry queue (above) and **DogStatsD capture/replay**
+  (`sources/dogstatsd/replay/`) — capture files written by a dedicated OS thread (protobuf
+  `UnixDogstatsdMsg` records + `TaggerState` trailer), bounded `sync_channel`, lock-free `ArcSwapOption`
+  for replay tagger state. Everything else is in-memory.
+- **Object pools** (`pooling/`): `FixedSizeObjectPool` (Mutex+Semaphore, async-blocks when empty),
+  `ElasticObjectPool` (min/max + background EWMA shrinker task — *if the shrinker future isn't driven,
+  the pool never shrinks*), `OnDemandObjectPool` (allocates every time).
+- **String interner** (`stringtheory/interning/fixed_size.rs`): fixed byte buffer, sharded
+  `[Arc<Mutex<…>>; SHARD_FACTOR]`, manual reclamation/tombstoning; `try_intern` returns `None` when full.
+  Interner determinism backs the context memory bound — *but* the resolver's default
+  `allow_heap_allocations=true` lets full-interner strings **spill to the heap (effectively unbounded)**.
+
+## 4. Concurrency model (Focus 3)
+
+- **Backpressure is real and is the load-safety mechanism:** all inter-component edges are bounded
+  `tokio::mpsc`; `Dispatcher::send` **awaits** on a full channel (`interconnect/dispatcher.rs:86-122`),
+  so a slow downstream blocks upstream all the way back to the socket read loop. The DSD source calls
+  `memory_limiter.wait_for_capacity().await` once per read (`mod.rs:1186`).
+- **Fan-out hazards:** an output with N senders clones to N-1 and moves into the last, awaiting each
+  **sequentially** — one slow consumer stalls delivery to the others. A disconnected output (zero
+  senders) **silently discards** events (`events_discarded_total`). Send is **not atomic** across
+  senders — partial delivery possible if a later sender errors.
+- **Memory limiter** (`resource-accounting/limiter.rs`): a dedicated **std::thread** polls RSS every
+  **250ms** and stores a backoff in an `AtomicU64`. Backpressure is **advisory/cooperative** (only
+  tasks that call `wait_for_capacity` are throttled) and **capped at 25ms** of sleep, starting at 95%
+  of limit. The checker `.expect()`s on the RSS read — **panics the limiter thread if `/proc` reads
+  fail mid-run**, silently disabling all memory backpressure.
+- **Interner reclamation** is loom-tested; the documented hazard (intern vs drop-last-ref race) is
+  resolved by a mutex + refcount re-check; worst case is a duplicate entry, never corruption. This is
+  the most concurrency-unsafe component in the bounded-memory story (raw pointers as `'static &str`
+  keys into a buffer that gets overwritten on reclaim).
+- **Health registry** (`health/mod.rs`): single `Arc<Mutex<…>>`; a single liveness `Runner` task;
+  per-component probe over `mpsc(1)`, 5s probe timeout. A deadlocked/blocked component fails to answer
+  → marked not-live, but **is not killed or restarted** — "blocked but alive" is an observable degraded
+  state with no auto-recovery.
+- **Config reload:** no in-place hot-reload of aggregate state was found; `live_config` is read once at
+  endpoint-task init. Config change appears to be reload-by-restart (open question, §9).
+
+## 5. Safety & liveness guarantees (Focus 4, 5) — candidate properties
+
+Claimed/observed guarantees, each a property seed (full treatment in `property-catalog.md`):
+
+**Safety (a bad thing never happens):**
+1. Backpressure, never silent inter-component drop (the await-on-full design; tested in DSD UDP path).
+2. Bounded memory — static startup bounds verification (`BoundsVerifier::verify`) rejects over-budget
+   topologies in strict mode. *(But not enforced at runtime — see §7.)*
+3. Aggregation output matches the Datadog Agent (counter→rate using bucket width as interval, half-open
+   `[start, start+width)` buckets — explicitly "to match the Datadog Agent").
+4. DDSketch relative-error guarantee: eps=1/128 (~0.78%), bin_limit=4096, bin count **must never exceed**
+   4096; merge associative/commutative.
+5. Config incompatibility is fatal at startup (high-severity incompatible non-default key → refuse to run).
+6. Graceful shutdown completes within 30s without forceful kill (in-flight data drained).
+
+**Liveness (a good thing eventually happens):**
+1. Aggregate always flushes on its interval (default 15s) regardless of input; final flush on stream close.
+2. Every passthrough/pre-timestamped metric is eventually forwarded.
+3. Topology starts accepting data only after all components report ready.
+4. Intake outage doesn't grow memory unbounded (retry queue caps) — but cap implies eventual drop on
+   prolonged outage (tension to investigate).
+5. After a transient intake outage clears, queued data is eventually delivered.
+
+## 6. Existing test strategy & coverage gaps (Focus 7) — where Antithesis adds value
+
+- **Unit tests:** dense (saluki-components ~618, saluki-core ~146, saluki-io ~102, ddsketch ~82).
+  Forwarder I/O and circuit breaker have good targeted tests but all use the `tokio::time` **virtual
+  clock** — no real interleaving/scheduling exploration.
+- **Correctness suite** (`make test-correctness`, `bin/correctness/`): **diff-testing**. `panoramic`
+  drives an **identical deterministic workload** (`millstone` load generator) into both the baseline
+  (Datadog Agent) and ADP, captures both via `datadog-intake` (mock intake), normalizes to a shared
+  `stele` representation, and diffs. Comparison is approximate (`RATIO_ERROR = 1e-8`); internal
+  telemetry filtered out; fixed `FLUSH_WAIT = 32s` after load. 21 cases, all DSD/OTLP happy-path.
+- **Integration suite** (`make test-integration`): real ADP in Docker, process-level assertions only
+  (`log_contains`, `port_listening`, `http_check`, `process_exits_with`, etc.). 27 cases (startup,
+  port binding, config-check exit codes, memory-mode exit behavior). Note: container s6 supervisor
+  **restarts ADP on exit**, so the container never actually exits.
+
+**Gaps (Antithesis's value):**
+1. **No fault injection of any kind** — grep for fault/chaos/partition/kill/crash/inject/toxiproxy/netem
+   found nothing. The intake is always healthy and reachable.
+2. **Intake down / slow / 5xx-storm never tested system-level** — retry queue, circuit breaker, disk
+   persistence, backpressure only unit-tested against in-process mocks.
+3. **No process-crash/restart mid-flow** — disk-persisted retry queue recovery never tested across a
+   real kill+restart.
+4. **No network partition / connection reset / TLS handshake failure** under steady state.
+5. **No timing/interleaving exploration** — diff testing is deterministic by design; concurrency bugs
+   in multi-endpoint fan-out, shared circuit-breaker state, JoinSet handling are invisible to it.
+6. **DogStatsD replay has zero suite coverage** despite being the newest, largest, untrusted-input feature.
+7. **Memory-pressure behavior under real load is untested** beyond boolean exit-code cases.
+8. **Config-stream drop/flap** not tested (one happy-path case `adp-config-stream`).
+
+## 7. Failure & degradation modes + unproven assumptions (Focus 8, 11) — attack surfaces
+
+The highest-value Antithesis targets. Several directly tension the headline guarantee.
+
+1. **Memory limiting is DISABLED by default** (`MemoryMode::default() == Disabled`,
+   `saluki-app/accounting.rs:37-40`) — no bounds validation *and* no runtime limiter unless the operator
+   sets `memory_mode` + `memory_limit`. cgroup auto-detect only triggers when `DOCKER_DD_AGENT` is set.
+2. **The bounded-memory guarantee is a startup assertion, not a runtime invariant** (Wildcard). Static
+   `BoundsVerifier` sums *declared* firm limits; nothing measures actual allocation. The only runtime
+   mechanism (`MemoryLimiter`) is advisory, cooperative, ≤25ms backoff, 250ms sampling, and the aggregate
+   insert hot loop does **not** call `wait_for_capacity` — so the aggregation map + interner grow under
+   pressure regardless; only the source is throttled.
+3. **Firm bound is known-incomplete** (`aggregate/mod.rs:249-256`): a single metric with many distinct
+   timestamped values isn't accounted for. Combined with default heap-fallback in the interner, declared
+   bound and real RSS diverge arbitrarily under high-cardinality / many-timestamp load.
+4. **Limiter thread panics if RSS becomes unreadable mid-run** (`.expect`, `limiter.rs:101-102`), silently
+   removing all memory protection.
+5. **≤25ms backoff + 250ms sampling may not prevent OOM** under bursty load — directly tests "won't crash
+   under load."
+6. **Source dispatch errors are logged and swallowed** (`dogstatsd/mod.rs:1667-1716`): a mid-buffer
+   dispatch failure can mis-route remaining events (eventd/service-check events leaking into the metrics
+   path), and the TODO admits it will "continue to fail to dispatch … until the process is restarted."
+7. **Silent drops, no warning:** aggregate context-limit (one warn/episode), non-finite metric values
+   (`non_finite_metric_values_are_silently_dropped`), interner-full contexts (config-dependent).
+8. **Sub-second aggregate window → panic — FIXED UPSTREAM (PR #1772):** historically
+   `bucket_width_secs = window.as_secs()` with no validation, so a value < 1s yielded `% 0`
+   divide-by-zero and `step_by(0)` panics. The key is now `aggregate_window_duration_seconds`, typed
+   `NonZeroU64`, and `bucket_width_secs` is `NonZeroU64` end-to-end (`aggregate/mod.rs:95-98,822-823`),
+   so zero/sub-second values fail config parsing rather than reaching the divisor. Retained here as a
+   closed wildcard; see `aggregate-no-panic-any-window` (now a regression tripwire).
+9. **Two-clock hazard** (Wildcard): bucketing uses **wall clock** (`get_unix_timestamp`), flush cadence
+   uses **monotonic** `tokio::interval`. A backward wall-clock jump makes the zero-value range empty
+   (silent counter gap); a forward jump floods zero-value points and allocates a large `SmallVec`. No
+   monotonicity guard. Also means a replayed capture buckets differently than when captured (the
+   aggregator ignores per-record timestamps for non-timestamped metrics).
+10. **NaN poisons a DDSketch** (`agent/sketch.rs:188-206`): `sum`/`avg` go NaN permanently; finiteness is
+    guarded per-source (DSD codec), not at the sketch boundary — fragile if a new producer is added.
+11. **All-non-finite packet → ghost metric** with a valid context but zero data points (interner/cache
+    pressure, flows downstream) rather than a dropped packet.
+12. **Replay reader treats corruption as clean EOF** (`replay/reader.rs:84-104`): a corrupt/oversized
+    length prefix silently truncates the remaining record stream — false replay-fidelity confidence.
+    ~25 unwrap/expect sites parsing untrusted capture files.
+13. **Core Agent reachability assumed at startup:** ADP blocks indefinitely on `dynamic_config.ready()`
+    with no visible timeout — if the Agent never sends config, ADP never starts the pipeline.
+14. **Hot-path panics:** numerous `.expect("… should always exist")` on default outputs, events/
+    service-check outputs, framing, sketch gamma/offset; `unreachable!("semaphore should never be
+    closed")` in pools; metrics-recorder `panic!`. Each is a crash if its invariant is violated.
+15. **UDP statsd-forward target:** on connect failure, packet forwarding is permanently disabled (no
+    retry); send errors debug-logged and dropped.
+
+## 8. Bug history & churn (Focus 6)
+
+Churn hotspots (last ~300 commits): `cli/run.rs` (wiring), `sources/dogstatsd/mod.rs` (the ~2500-line
+DSD source — biggest, most-changed file), `internal/control_plane.rs`, config-registry alignment files,
+`common/datadog/io.rs` (forwarder), metrics encoder. Notable correctness fixes (good property seeds):
+drop non-finite floats in codec; compensated summation for histograms; unitless histogram counts;
+match agent timestamped-count sampling; **moved DSD prefix/filter in front of enrich** (pipeline
+ordering bug); stabilize additional-endpoint retry-queue IDs (now load-bearing for API-key rotation).
+**Most regression-prone area: DogStatsD replay** (`e88d04b10a`, +1765 LOC, draft, validated only with
+`cargo check`, zero suite coverage, parses untrusted files). 123 TODO/FIXME/HACK markers, clustered in
+saluki-components and saluki-io; safety-relevant ones around dispatch partial-failure, disk-limit
+"bailing out," and undecided malformed-input error policy in the codecs.
+
+## 9. Assumptions & open questions
+
+- **METRIC_CONTROL relay naming:** grounding references a Remote Config METRIC_CONTROL relay; no such
+  identifier exists in the source. Config control appears to use the generic snapshot/partial config
+  stream. Confirm naming/mechanism against Confluence ("Tag Filter RC Relay Stress Test").
+- **`interconnect_capacity` default** not yet read — needed to model backpressure precisely.
+- **Config hot-reload semantics:** confirmed no in-place aggregate-state reload; is *any* runtime config
+  change applied without restart? Affects whether config-reload-mid-flight is a real attack surface.
+- **Saturated forwarder retry queue under prolonged outage:** drop vs block tension — confirm exact
+  bound and whether memory stays bounded while data is eventually shed.
+- **`persisted.rs` disk-full / partial-write / corrupt-file across crash** not deeply read (~47
+  unwrap/expect sites) — relevant to the durability/data-loss property family.
+- Discovery was read-only; no builds/tests executed. Test *counts* are annotation greps.
+
+## 10. Implications for property selection & topology
+
+Antithesis is strongest exactly where the existing suite is blind: **degraded/down/slow intake under
+sustained load**, **process kill+restart with disk-persisted retry-queue durability**, **memory overload
+to test the soft-backpressure-only design**, **malformed/corrupt replay capture parsing**, **clock skew
+into aggregation**, and **timing/interleaving** in the forwarder and interner. The natural deployment
+mirrors the correctness harness — ADP + a controllable mock intake + a deterministic load generator —
+but adds Antithesis fault injection (network, process, clock) that the harness lacks. See
+`property-catalog.md` and `deployment-topology.md`.