feat(obs): Phase 1 — Backend integrity fixes (FDD-OBS-001 remediation)#30
Merged
nascimentolimaandre-cloud merged 10 commits intoMay 15, 2026
Merged
Conversation
…ocol (FDD-OBS-001 T1.1) PR 4a.5 shipped `list_monitors_for_service` on `DatadogProvider` and the rollup worker calls it via the Protocol-typed argument at `rollup_service.py:293`, but the method was never declared on the Protocol itself. A second adapter (NR R3) without that method would have raised `AttributeError` at runtime — only a missing line in `base.py` separated ADR-023 from a sharp edge. - Declare `list_monitors_for_service(service) -> list[MonitorState]` on the `ObservabilityProvider` Protocol in `base.py`. - Extend the `mock_observability_provider` conftest fixture with a `monitors=` kwarg so spec-checked mocks expose the new method. - Amend ADR-023 with a dated addendum documenting the addition. - Lock the Protocol surface via a new `test_protocol_compliance.py` (14 tests) that asserts every required method is on the Protocol AND that `DatadogProvider` passes `isinstance(..., ObservabilityProvider)` thanks to `@runtime_checkable`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…BS-001 T1.3)
`database._set_tenant` was using `text(f"SET app.current_tenant = '{tenant_id}'")`.
Safe in practice because every call site types `tenant_id` as UUID
and Pydantic enforces that upstream, but the pattern is H-severity
the day any caller widens the type to `str` — RLS would be one
typo away from a tenant-bypass.
- Rewrite to `SELECT set_config('app.current_tenant', :t, true)`
with a bound parameter (`SET` itself doesn't accept binds; the
`set_config(...)` helper is the canonical Postgres equivalent and
honours `SET LOCAL` semantics via the third arg).
- Add `tests/unit/test_database_security.py` with 5 regression tests:
signature stays typed UUID, source contains no `text(f"`, end-to-end
invocation passes the bound-param dict, hostile string lands as a
VALUE (not SQL), and an audit lint that catches new f-string SQL
sites in `src/`.
Audit summary of OTHER `text(f"...")` sites in `src/`:
- `flow_health_on_demand.py:450` — splices module-level INT constant
into `SET LOCAL statement_timeout`. SET LOCAL doesn't accept binds;
value not user-controlled. LEFT AS-IS (documented).
- `pipeline/routes.py:969` — splices pre-built fragments whose only
variable content is bound-param placeholder names (:s0, :before).
User input flows in as bound VALUES. LEFT AS-IS (documented).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-OBS-001 T1.4) CISO H-002 wired `hide_parameters=True` on the SQLAlchemy engine, which covers exceptions formatted by SQLAlchemy. But asyncpg / psycopg sometimes raise BEFORE SA wraps the exception — those raw exceptions can carry the prepared statement text + bound-parameter VALUES, which for `credential_service.upsert_credential` includes the pgcrypto master key and the plaintext API key (caught in the wild during the PR 2 AmbiguousParameterError incident). - New `src/shared/exception_middleware.py` with `SanitizingExceptionMiddleware`: catches every unhandled Exception at the FastAPI boundary, logs ONLY the exception class name + method + path + a fresh request_id (no `str(exc)`, no `repr(exc)`, no `exc.args`, no exc_info traceback), returns an opaque 500 JSON. - Wire it as the OUTERMOST middleware in `main.py` so it catches errors raised by inner middlewares (Tenant) and route handlers. - Add `tests/unit/test_exception_middleware.py` with 4 tests: 500 returned, opaque body, fake-secret pattern stays out of every log record (message + args + exc_info + exc_text), successful requests pass through, request_id is unique per invocation. Defense in depth — does NOT replace `hide_parameters=True`. Both layers must remain in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration 018 installed `obs_no_pii_in_metadata()`, the Layer-2 ADR-025
guard that blocks INSERT/UPDATE on `service_squad_ownership` where the
`metadata` JSONB carries a known-PII key. The original trigger used
`NEW.metadata ? k` — top-level key existence only. Anything nested
slipped through:
{"deep": {"nested": {"user.email": "alice@webmotors.com"}}}
silently passed. That defeats Layer 2's purpose (it's supposed to be
the safety net for Layer-1 misses).
- Migration 023 rewrites the function as a RECURSIVE walk over the
JSONB tree using `WITH RECURSIVE` + `jsonb_each` + `jsonb_array_elements`,
collecting every key at every depth and raising `check_violation`
on the first match.
- Forbidden key list mirrors `FORBIDDEN_FIELD_NAMES` in
`_anti_surveillance.py` (kept in lockstep via test).
- Applied via direct SQL on the live DB (alembic CLI has the
documented asyncpg issue); `alembic_version` advanced from
`022_obs_metric_monitor_health` to `023_obs_pii_trigger_recursive`.
- 19 integration tests in `tests/integration/test_obs_pii_trigger_recursive.py`:
parametrized top-level guard (15 keys), deeply-nested case, PII
inside JSONB array element, clean metadata passes, UPDATE path
also blocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SK-12/13/17)
Three small but real hardenings of the anti-surveillance fences.
RISK-12 — Layer 4 source scan didn't walk `src/workers/obs_*.py`.
The obs-rollup worker is the largest moving piece of business
logic in the bounded context; today it's clean but a future
regression (e.g. adding `pr.author` to a rollup SQL builder)
wouldn't be caught at PR time. Extend `_iter_observability_python_files`
to also include `src/workers/obs_*.py`.
RISK-13 — observability code can JOIN against engineering-data
tables (`eng_pull_requests`) where `pr.author`, `pr.merge_by`,
`pr.reviewer`, etc. are real columns the engineering-data domain
legitimately needs. The observability domain MUST NOT touch them.
- New frozenset `FORBIDDEN_SQL_COLUMNS` in `_anti_surveillance.py`
with the five named columns from the task.
- `TestForbiddenSqlColumnsScan` greps for these references in
every observability module (incl. workers/obs_*.py after RISK-12).
RISK-17 — Layer 1 nested PII pairs missed four shapes we've
observed in real payloads:
- Datadog monitor `{"creator": {"email": ..., "name": ...}}`
- Jira metadata sometimes nests `{"modified_by": {...}}`
- GitHub metadata uses `{"author": {"email": ...}}`
Add `(creator,email)`, `(creator,name)`, `(modified_by,email)`,
`(author,email)` to `FORBIDDEN_PARENT_CHILD_PAIRS`. Note: `creator`
and `modified_by` are ALSO in `FORBIDDEN_FIELD_NAMES` (always
dropped at top level); the pair rule is the safety net if either
ever appears as a non-top-level nested key. `author` is only
blocked in pair form (squad-level author names are legitimate
metadata).
13 new test assertions in `test_obs_anti_surveillance.py`
(11 → 24 tests passing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SK-8)
Pre-customer blocker — RISK-8 in the backlog flagged "no rotation
script + no runbook" as one of the unresolved items. Without these,
an operator who suspects a leak has no documented path from
"PULSE_OBS_MASTER_KEY may be compromised" to "PULSE_OBS_MASTER_KEY
rotated cleanly". This commit ships both, with full test coverage.
- New `pulse/scripts/rotate_obs_master_key.py`:
- Reads OLD master key from `PULSE_OBS_MASTER_KEY`, NEW from
`PULSE_OBS_MASTER_KEY_NEW`. Both ≥ 32 chars validated.
- Iterates `tenant_observability_credentials` row by row, each in
its own transaction so a partial failure doesn't roll back the
whole table.
- For each: decrypt with OLD inside postgres (one statement, no
plaintext crosses the wire other than as a bound parameter for
fingerprint recomputation), re-encrypt with NEW, update
`key_fingerprint` (unchanged since plaintext is the same),
`last_rotated_at = NOW()`.
- Defaults to DRY-RUN. `--apply` to actually mutate.
- Idempotent: OLD == NEW is a logged no-op (returns 0).
- Never logs `str(exc)` from a failed decrypt (pgcrypto wrong-key
errors can leak bound params in some driver paths).
- New `pulse/docs/runbooks/obs-master-key-rotation.md` covering:
- Trigger criteria (suspected leak / quarterly / turnover).
- Pre-flight checklist (backup, worker pause, fingerprint snapshot).
- Step-by-step rotation (dry-run, apply, env update, validation).
- Recovery procedure for half-completed rotations.
- Audit logging requirements.
- The CRITICAL SAFETY RULE: the operator edits `.env` themselves;
Claude never handles raw secret values.
- 11 tests in `tests/unit/test_rotate_obs_master_key.py`:
- Fingerprint matches `credential_service.fingerprint` (M-005 lock).
- Env-var validation (missing, short).
- Live round-trip: seed row with OLD, rotate, decrypt with NEW,
plaintext matches.
- Dry-run preserves the row.
- Idempotence when OLD == NEW.
Total obs+infra tests: 312 passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…FIND-001)
logger.exception attaches exc_info=True under the hood, which serializes
the full traceback including local variable bindings and chained
exceptions via __traceback__. Driver-level exceptions (asyncpg) can
carry bound parameter values — exactly what T1.4's middleware was built
to close. But both call sites are INSIDE handlers and never reach the
middleware.
Audit found 2 sites (not 1 as CISO listed):
- routes.py:125 inside validate_datadog_credential
- obs_rollup_worker.py:76 inside _run_one_cycle wrapper
Both replaced with logger.error("... err_class=%s", type(exc).__name__).
New regression test fails if logger.exception sneaks back in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SO FIND-003) Catching bare Exception swallowed FastAPI's HTTPException (and its Starlette parent), converting every legitimate 4xx into an opaque 500. That kills Pydantic 422s, InvalidSiteError 422, get_provider_metadata 404s, and rotation 503s — all of which clients are supposed to read. Fix: re-raise StarletteHTTPException explicitly before the sanitize branch. FastAPI's HTTPException subclasses it, so one catch handles both. The Exception branch keeps doing its job for genuine driver/unexpected errors. +2 tests in test_exception_middleware.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The migration 023_obs_pii_trigger_recursive.py and the Python _anti_surveillance.FORBIDDEN_FIELD_NAMES are kept in lockstep by the comment in both files — but no test enforces it. They match today; the moment someone updates one side, the layers silently diverge. New test parses the migration's _FORBIDDEN_KEYS tuple via ast and asserts set-equality with the Python frozenset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…002) The module docstring claimed "Plaintext decrypt happens INSIDE postgres ... in a single transaction" — misleading. The script DOES decrypt inside postgres, but the plaintext returns to Python heap to compute the M-005 fingerprint before being re-encrypted in a second statement. Two separate transactions per row, plaintext transits Python briefly. The script is correct; the documentation lied about the threat model. Rewrites the docstring + adds a "What rotation does (and the threat model)" section to the runbook with the honest 5-step data flow + explicit residence/risk statement. Adds `del drow` after the UPDATE to make the plaintext residence window explicit in code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nascimentolimaandre-cloud
pushed a commit
that referenced
this pull request
May 15, 2026
Five files were authored in the OneDrive clone before the dev environment
migrated to ~/Code and never made it into git:
- docs/security-reviews/FDD-OBS-001-phase1-review.md
CISO's Phase 1 review (Conditional Approve, 4 findings) — the
document that drove PR #30's 4 must-fix commits. Important
historical record of the review.
- docs/ux-specs/_pulse-web-component-gap.md
- docs/ux-specs/_pulse-web-component-inventory.md
pulse-ux-reviewer audit of pulse-web's current component coverage
vs. what FDD-OBS-001 Phase 3+ will need — hand-off to pulse-engineer.
- docs/ux-specs/observability-aliases-impl-spec.md (Phase 4 UI spec)
- docs/ux-specs/observability-ownership-impl-spec.md (Phase 3 UI spec)
pulse-ux-reviewer impl specs for the React surfaces the next phases
will componentize against the design system.
No code changes — docs-only. Closes the OneDrive → ~/Code migration gap.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: DRAFT — 4 CISO must-fix items pending
This PR contains the 6 commits from Phase 1 backend integrity work (T1.1 → T1.6). CISO async review returned Conditional Approve — 4 must-fix items will land as additional commits before this is marked Ready for Review.
Commits in (Track A, pulse-engineer)
473c561list_monitors_for_serviceto ObservabilityProvider Protocol (+14 tests)65bee6d_set_tenantwith bound paramset_config(:t, ...); future-regression audit lint (+5 tests)810af43SanitizingExceptionMiddleware(outermost); logs only ClassName + request_id; neverstr(exc),exc_info,__notes__(+4 tests)64bb479WITH RECURSIVEjsonb PII trigger catches at any depth + array elements (+19 integration tests)1cbdd1a802a3e7scripts/rotate_obs_master_key.py+ runbook; dry-run default; idempotent (+11 tests)312 obs+infra tests passing (+66 from baseline).
Pending CISO fixes (in this PR before Ready for Review)
logger.exceptioninroutes.py:125(bypasses T1.4) withlogger.error(..., type(exc).__name__)+ audit other sitesHTTPException(4xx) instead of swallowing into 500_FORBIDDEN_KEYS↔ PythonFORBIDDEN_FIELD_NAMESFull review at
docs/security-reviews/FDD-OBS-001-phase1-review.md.Deferred to Phase 2 review
detail=str(exc)audit in 7 placesmerged_by,requested_reviewers🤖 Generated with Claude Code