Skip to content

Commit e07b20b

Browse files
etrclaude
andcommitted
TASK-080: tighten threadsafety_stress latency gate from 100x p99 to 20x p95
Restores meaningful regression bite on the adversarial_segments registration latency gate (sub-test C of threadsafety_stress) without flaking on CI scheduler noise. Stabilisation stack applied: * Per-thread sample buffers eliminate the previous shared std::mutex-guarded std::vector<int64_t>::push_back from the timing window. Lock-wait jitter from prior iterations no longer leaks into subsequent samples' cache lines. * New HTTPSERVER_STRESS_PIN_CPU env knob (Linux only) pins the four writer threads to a single CPU via pthread_setaffinity_np. Counter- intuitive but correct: writers serialise on route_table_mutex_ anyway, and single-CPU placement eliminates cross-CPU cache misses on radix-tree node memory. Off by default; macOS / Windows: no-op. * New HTTPSERVER_STRESS_REPEATS env knob drives N back-to-back sampling rounds per test invocation, printing one [STATS] line per round. Used to build per-lane noise-floor CDFs without needing to rerun the full make-check N times. Capped at 200. When N>1 the gate is checked against the worst-observed p95 across all rounds. Gate redesign: * Statistic switched from p99 to p95. p99 = top 150 samples on a 15 000-op run; a single 1 ms OS-scheduler preemption spike against a ~10 us median gives a 100x ratio that is purely environmental. p95 = top 750 samples and is robust against single spikes while still catching algorithmic regressions (an O(n) traversal at 15k items would shift the entire upper quartile). p99 is still printed in the [STATS] line as a forensic diagnostic. * Threshold ratio lowered from 100x to 20x. Even with the stabilisation stack in place, the dominant noise floor on a quiet Apple Silicon laptop is legitimate route_table_mutex_ contention, not OS noise. 10x is infeasible without rewriting the registration lock strategy (out of scope). 20x gives ~50% headroom over the worst observed local 10-round sweep (13.4x), is still 5x tighter than the pre-TASK-080 gate, and restores real bite against algorithmic regressions. Documentation: * test/PERFORMANCE.md gains a "Methodology - threadsafety_stress adversarial_segments latency gate" section with the gate table, the per-statistic noise rationale (why p95), the per-ratio rationale (why 20x), and the HTTPSERVER_STRESS_REPEATS workflow for re-measuring on a flaky CI lane. * Inline test comment updated to point at the methodology section and record both design choices (statistic switch, threshold choice) with their measurement evidence. Acceptance criterion notes: * The TASK-080 "no flake in 50 CI runs across the matrix" criterion cannot be enforced at PR-time. Proxy used: local 10-round sweep (worst p95 ratio 13.4x against the 20x gate, ~50% headroom). Post-merge monitoring window covered in PERFORMANCE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 52a06e0 commit e07b20b

2 files changed

Lines changed: 433 additions & 121 deletions

File tree

test/PERFORMANCE.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,115 @@ padding` ≈ map_size minus ~8 bytes. The corrected algebra above
149149
captures the actual contract ("the map went away") without
150150
papering over the new field's cost.
151151
152+
## Methodology — `threadsafety_stress` adversarial_segments latency gate
153+
154+
### What this gate measures
155+
156+
`test/integ/threadsafety_stress.cpp` sub-test C
157+
(`adversarial_segments_registration_no_latency_spike`) hammers the
158+
`webserver::register_path` mutating API with an adversarial corpus of
159+
15 000 sibling path segments distributed across 3 parent prefixes,
160+
driven by 4 contending writer threads. The corpus shape (24-byte common
161+
prefix, 8-byte discriminating tail) is the worst case for
162+
`std::map<std::string>::find` per-probe cost in the radix tree's
163+
per-segment child container, so per-op insert cost will surface any
164+
algorithmic regression (e.g. a future refactor that drops back to O(n)
165+
sibling scan).
166+
167+
The gate encodes the PRD §3.6 / TASK-056 "no dispatch latency spikes
168+
> 10× baseline" criterion as a deterministic ratio assertion against
169+
the warmup-window median.
170+
171+
### Gate
172+
173+
| Parameter | Value | Rationale |
174+
|---|---|---|
175+
| Statistic | p95 of overall samples vs median of first quarter | p99 too sensitive to OS preemption on shared CI |
176+
| Threshold ratio | < 20× warmup median | TASK-080 noise-floor study (see below) |
177+
| Baseline floor clamp | warmup_median = max(actual, 1 µs) | Prevents degenerate gate when timer quantises |
178+
179+
### Stabilisation techniques in effect
180+
181+
| Technique | Status | Notes |
182+
|---|---|---|
183+
| Per-thread sample buffers (no hot-path lock) | adopted | Eliminates `samples_mtx` from the timing window; the previous global `std::mutex`-guarded `push_back` leaked prior-iteration lock-wait jitter into the next sample's cache lines |
184+
| Linux CPU pinning of writer threads | optional, off by default | `HTTPSERVER_STRESS_PIN_CPU=N` pins all 4 writers to CPU N via `pthread_setaffinity_np`. Counter-intuitively single-CPU pinning is correct here — writers serialise on `route_table_mutex_` regardless, so single-CPU placement eliminates cross-CPU cache misses on radix-tree node memory. macOS / Windows: no-op |
185+
| Statistic switch p99 → p95 | adopted | See "Why p95, not p99" below |
186+
| Top-N% trimming | rejected | "Trim before gate" is unprincipled and looks like hiding regressions. Switching the statistic (p99 → p95) is principled — the gate now uses a more robust order statistic, not censored data |
187+
| `__rdtsc` high-precision timer | rejected | `std::chrono::steady_clock` (≈20 ns resolution on Linux, ≈40 ns on macOS) is fine for 10-100+ µs samples; TSC drift across cores is not worth the portability cost |
188+
189+
### Why p95, not p99
190+
191+
p99 on a 15 000-sample run = top 150 samples. A single 1 ms OS-scheduler
192+
preemption spike (kernel tick, neighbour-process scheduling, page-fault
193+
servicing) against a ~10 µs median produces a 100× ratio that is purely
194+
environmental — not a property of the algorithm under test. p95 = top
195+
750 samples and is robust against that: an O(n) algorithmic regression
196+
at 15k items would shift the entire upper quartile (p95 included); a
197+
single preemption spike does not.
198+
199+
p99 is still printed in the `[STATS]` diagnostic line for forensic use.
200+
201+
### Why 20×, not 10×
202+
203+
The TASK-080 stabilisation stack reduces but does NOT eliminate the
204+
noise floor. The dominant residual contributor is **legitimate
205+
contention on `route_table_mutex_`**, not OS noise: 4 writer threads
206+
serialise on a single std::mutex around the radix-tree insert, and the
207+
top 5% of samples are precisely the lock-wait queue tail.
208+
209+
| Sweep | Worst observed p95/warmup_median ratio | Notes |
210+
|---|---|---|
211+
| TASK-080 measurement, Apple Silicon (M-series), `-O3 -DNDEBUG`, `HTTPSERVER_STRESS_REPEATS=10`, no pinning | 13.4× | Quiet laptop, no other tenants |
212+
| Pre-TASK-080 baseline (with `samples_mtx` in hot path) | similar p95, larger p99 spread | Per-thread buffers tighten p99 more than p95 |
213+
214+
10× is therefore genuinely infeasible without rewriting the
215+
registration locking strategy (out of scope for TASK-080). 20× gives
216+
~50% headroom over the worst observed local round and is still **5×
217+
tighter than the pre-TASK-080 gate of 100× p99** — restoring real
218+
regression bite against algorithmic regressions (an accidental O(n)
219+
traversal at 15k items would push p95 to >100× the baseline).
220+
221+
### How to re-measure
222+
223+
Run the test with `HTTPSERVER_STRESS_REPEATS=N` to drive N back-to-back
224+
sampling rounds within a single test invocation. Each round prints a
225+
`[STATS]` line; the gate is checked against the worst-observed p95
226+
across rounds.
227+
228+
```sh
229+
# Single-shot diagnostic on the current host
230+
cd build
231+
HTTPSERVER_STRESS_SECONDS=15 HTTPSERVER_STRESS_REPEATS=20 \
232+
./test/threadsafety_stress 2>&1 | grep STATS
233+
234+
# Linux: with CPU pinning
235+
HTTPSERVER_STRESS_SECONDS=15 HTTPSERVER_STRESS_REPEATS=20 \
236+
HTTPSERVER_STRESS_PIN_CPU=0 ./test/threadsafety_stress 2>&1 | grep STATS
237+
238+
# Aggregate the [STATS] lines to compute per-lane p95/baseline CDFs.
239+
# The relevant fields are warmup_median, p95, p99, and p95_ratio
240+
# (printed in percent of warmup median, so 1300% = 13×).
241+
```
242+
243+
### Acceptance criterion verification — 50-run stability
244+
245+
The TASK-080 acceptance criterion "test has not flaked in the last 50
246+
CI runs across the matrix" cannot be enforced at PR-time (a PR has 1
247+
run per lane, not 50). The proxy used at merge:
248+
249+
1. Local 10-round sweep on the maintainer's reference host (Apple
250+
Silicon, `-O3 -DNDEBUG`, no pinning) — worst observed p95 ratio
251+
13.4× against the 20× gate (gate margin: ~50%).
252+
2. Post-merge monitoring window: any flake of this test on
253+
`feature/v2.0` CI within the first week of merge is grounds for
254+
re-opening TASK-080 and re-running the noise-floor sweep on the
255+
flaking lane.
256+
257+
If a CI flake surfaces post-merge, capture the `[STATS]` line from the
258+
failing job logs, then re-run locally on the same lane shape with
259+
`HTTPSERVER_STRESS_REPEATS=50` to characterise the new noise floor.
260+
152261
## How to re-run on this branch
153262

154263
```sh

0 commit comments

Comments
 (0)