Skip to content

feat(obs): Phase 1 — Backend integrity fixes (FDD-OBS-001 remediation)#30

Merged
nascimentolimaandre-cloud merged 10 commits into
mainfrom
feat/obs-001-phase1-backend-fixes
May 15, 2026
Merged

feat(obs): Phase 1 — Backend integrity fixes (FDD-OBS-001 remediation)#30
nascimentolimaandre-cloud merged 10 commits into
mainfrom
feat/obs-001-phase1-backend-fixes

Conversation

@nascimentolimaandre-cloud
Copy link
Copy Markdown
Owner

Status: DRAFT — 4 CISO must-fix items pending

This PR contains the 6 commits from Phase 1 backend integrity work (T1.1 → T1.6). CISO async review returned Conditional Approve — 4 must-fix items will land as additional commits before this is marked Ready for Review.

Commits in (Track A, pulse-engineer)

Commit Sub-task Highlight
473c561 T1.1 Add list_monitors_for_service to ObservabilityProvider Protocol (+14 tests)
65bee6d T1.3 Replace f-string SQL in _set_tenant with bound param set_config(:t, ...); future-regression audit lint (+5 tests)
810af43 T1.4 New SanitizingExceptionMiddleware (outermost); logs only ClassName + request_id; never str(exc), exc_info, __notes__ (+4 tests)
64bb479 T1.5 RISK-7 alembic 023: WITH RECURSIVE jsonb PII trigger catches at any depth + array elements (+19 integration tests)
1cbdd1a T1.6 RISK-12/13/17: Layer 4 scan widens to workers/, FORBIDDEN_SQL_COLUMNS, 4 nested pairs (+13 tests)
802a3e7 T1.2 RISK-8: scripts/rotate_obs_master_key.py + runbook; dry-run default; idempotent (+11 tests)

312 obs+infra tests passing (+66 from baseline).

Pending CISO fixes (in this PR before Ready for Review)

ID Sev Fix
FIND-001 🚨 HIGH Replace logger.exception in routes.py:125 (bypasses T1.4) with logger.error(..., type(exc).__name__) + audit other sites
FIND-003 ⚠️ MED Middleware to re-raise HTTPException (4xx) instead of swallowing into 500
FIND-004 ⚠️ MED Add contract test: SQL trigger _FORBIDDEN_KEYS ↔ Python FORBIDDEN_FIELD_NAMES
FIND-002 📝 doc Correct rotation docstring + runbook about plaintext transiting Python heap

Full review at docs/security-reviews/FDD-OBS-001-phase1-review.md.

Deferred to Phase 2 review

  • Auth gap on admin endpoints (TenantMiddleware is MVP passthrough)
  • detail=str(exc) audit in 7 places
  • Widen FORBIDDEN_SQL_COLUMNS with merged_by, requested_reviewers

🤖 Generated with Claude Code

Andre.Nascimento and others added 10 commits May 11, 2026 14:49
…ocol (FDD-OBS-001 T1.1)

PR 4a.5 shipped `list_monitors_for_service` on `DatadogProvider` and
the rollup worker calls it via the Protocol-typed argument at
`rollup_service.py:293`, but the method was never declared on the
Protocol itself. A second adapter (NR R3) without that method would
have raised `AttributeError` at runtime — only a missing line in
`base.py` separated ADR-023 from a sharp edge.

- Declare `list_monitors_for_service(service) -> list[MonitorState]`
  on the `ObservabilityProvider` Protocol in `base.py`.
- Extend the `mock_observability_provider` conftest fixture with a
  `monitors=` kwarg so spec-checked mocks expose the new method.
- Amend ADR-023 with a dated addendum documenting the addition.
- Lock the Protocol surface via a new
  `test_protocol_compliance.py` (14 tests) that asserts every
  required method is on the Protocol AND that `DatadogProvider`
  passes `isinstance(..., ObservabilityProvider)` thanks to
  `@runtime_checkable`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…BS-001 T1.3)

`database._set_tenant` was using `text(f"SET app.current_tenant = '{tenant_id}'")`.
Safe in practice because every call site types `tenant_id` as UUID
and Pydantic enforces that upstream, but the pattern is H-severity
the day any caller widens the type to `str` — RLS would be one
typo away from a tenant-bypass.

- Rewrite to `SELECT set_config('app.current_tenant', :t, true)`
  with a bound parameter (`SET` itself doesn't accept binds; the
  `set_config(...)` helper is the canonical Postgres equivalent and
  honours `SET LOCAL` semantics via the third arg).
- Add `tests/unit/test_database_security.py` with 5 regression tests:
  signature stays typed UUID, source contains no `text(f"`, end-to-end
  invocation passes the bound-param dict, hostile string lands as a
  VALUE (not SQL), and an audit lint that catches new f-string SQL
  sites in `src/`.

Audit summary of OTHER `text(f"...")` sites in `src/`:
- `flow_health_on_demand.py:450` — splices module-level INT constant
  into `SET LOCAL statement_timeout`. SET LOCAL doesn't accept binds;
  value not user-controlled. LEFT AS-IS (documented).
- `pipeline/routes.py:969` — splices pre-built fragments whose only
  variable content is bound-param placeholder names (:s0, :before).
  User input flows in as bound VALUES. LEFT AS-IS (documented).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-OBS-001 T1.4)

CISO H-002 wired `hide_parameters=True` on the SQLAlchemy engine,
which covers exceptions formatted by SQLAlchemy. But asyncpg / psycopg
sometimes raise BEFORE SA wraps the exception — those raw exceptions
can carry the prepared statement text + bound-parameter VALUES, which
for `credential_service.upsert_credential` includes the pgcrypto
master key and the plaintext API key (caught in the wild during the
PR 2 AmbiguousParameterError incident).

- New `src/shared/exception_middleware.py` with
  `SanitizingExceptionMiddleware`: catches every unhandled Exception
  at the FastAPI boundary, logs ONLY the exception class name +
  method + path + a fresh request_id (no `str(exc)`, no `repr(exc)`,
  no `exc.args`, no exc_info traceback), returns an opaque 500 JSON.
- Wire it as the OUTERMOST middleware in `main.py` so it catches
  errors raised by inner middlewares (Tenant) and route handlers.
- Add `tests/unit/test_exception_middleware.py` with 4 tests:
  500 returned, opaque body, fake-secret pattern stays out of every
  log record (message + args + exc_info + exc_text), successful
  requests pass through, request_id is unique per invocation.

Defense in depth — does NOT replace `hide_parameters=True`. Both
layers must remain in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration 018 installed `obs_no_pii_in_metadata()`, the Layer-2 ADR-025
guard that blocks INSERT/UPDATE on `service_squad_ownership` where the
`metadata` JSONB carries a known-PII key. The original trigger used
`NEW.metadata ? k` — top-level key existence only. Anything nested
slipped through:

    {"deep": {"nested": {"user.email": "alice@webmotors.com"}}}

silently passed. That defeats Layer 2's purpose (it's supposed to be
the safety net for Layer-1 misses).

- Migration 023 rewrites the function as a RECURSIVE walk over the
  JSONB tree using `WITH RECURSIVE` + `jsonb_each` + `jsonb_array_elements`,
  collecting every key at every depth and raising `check_violation`
  on the first match.
- Forbidden key list mirrors `FORBIDDEN_FIELD_NAMES` in
  `_anti_surveillance.py` (kept in lockstep via test).
- Applied via direct SQL on the live DB (alembic CLI has the
  documented asyncpg issue); `alembic_version` advanced from
  `022_obs_metric_monitor_health` to `023_obs_pii_trigger_recursive`.
- 19 integration tests in `tests/integration/test_obs_pii_trigger_recursive.py`:
  parametrized top-level guard (15 keys), deeply-nested case, PII
  inside JSONB array element, clean metadata passes, UPDATE path
  also blocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SK-12/13/17)

Three small but real hardenings of the anti-surveillance fences.

RISK-12 — Layer 4 source scan didn't walk `src/workers/obs_*.py`.
  The obs-rollup worker is the largest moving piece of business
  logic in the bounded context; today it's clean but a future
  regression (e.g. adding `pr.author` to a rollup SQL builder)
  wouldn't be caught at PR time. Extend `_iter_observability_python_files`
  to also include `src/workers/obs_*.py`.

RISK-13 — observability code can JOIN against engineering-data
  tables (`eng_pull_requests`) where `pr.author`, `pr.merge_by`,
  `pr.reviewer`, etc. are real columns the engineering-data domain
  legitimately needs. The observability domain MUST NOT touch them.
  - New frozenset `FORBIDDEN_SQL_COLUMNS` in `_anti_surveillance.py`
    with the five named columns from the task.
  - `TestForbiddenSqlColumnsScan` greps for these references in
    every observability module (incl. workers/obs_*.py after RISK-12).

RISK-17 — Layer 1 nested PII pairs missed four shapes we've
  observed in real payloads:
  - Datadog monitor `{"creator": {"email": ..., "name": ...}}`
  - Jira metadata sometimes nests `{"modified_by": {...}}`
  - GitHub metadata uses `{"author": {"email": ...}}`
  Add `(creator,email)`, `(creator,name)`, `(modified_by,email)`,
  `(author,email)` to `FORBIDDEN_PARENT_CHILD_PAIRS`. Note: `creator`
  and `modified_by` are ALSO in `FORBIDDEN_FIELD_NAMES` (always
  dropped at top level); the pair rule is the safety net if either
  ever appears as a non-top-level nested key. `author` is only
  blocked in pair form (squad-level author names are legitimate
  metadata).

13 new test assertions in `test_obs_anti_surveillance.py`
(11 → 24 tests passing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SK-8)

Pre-customer blocker — RISK-8 in the backlog flagged "no rotation
script + no runbook" as one of the unresolved items. Without these,
an operator who suspects a leak has no documented path from
"PULSE_OBS_MASTER_KEY may be compromised" to "PULSE_OBS_MASTER_KEY
rotated cleanly". This commit ships both, with full test coverage.

- New `pulse/scripts/rotate_obs_master_key.py`:
  - Reads OLD master key from `PULSE_OBS_MASTER_KEY`, NEW from
    `PULSE_OBS_MASTER_KEY_NEW`. Both ≥ 32 chars validated.
  - Iterates `tenant_observability_credentials` row by row, each in
    its own transaction so a partial failure doesn't roll back the
    whole table.
  - For each: decrypt with OLD inside postgres (one statement, no
    plaintext crosses the wire other than as a bound parameter for
    fingerprint recomputation), re-encrypt with NEW, update
    `key_fingerprint` (unchanged since plaintext is the same),
    `last_rotated_at = NOW()`.
  - Defaults to DRY-RUN. `--apply` to actually mutate.
  - Idempotent: OLD == NEW is a logged no-op (returns 0).
  - Never logs `str(exc)` from a failed decrypt (pgcrypto wrong-key
    errors can leak bound params in some driver paths).

- New `pulse/docs/runbooks/obs-master-key-rotation.md` covering:
  - Trigger criteria (suspected leak / quarterly / turnover).
  - Pre-flight checklist (backup, worker pause, fingerprint snapshot).
  - Step-by-step rotation (dry-run, apply, env update, validation).
  - Recovery procedure for half-completed rotations.
  - Audit logging requirements.
  - The CRITICAL SAFETY RULE: the operator edits `.env` themselves;
    Claude never handles raw secret values.

- 11 tests in `tests/unit/test_rotate_obs_master_key.py`:
  - Fingerprint matches `credential_service.fingerprint` (M-005 lock).
  - Env-var validation (missing, short).
  - Live round-trip: seed row with OLD, rotate, decrypt with NEW,
    plaintext matches.
  - Dry-run preserves the row.
  - Idempotence when OLD == NEW.

Total obs+infra tests: 312 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…FIND-001)

logger.exception attaches exc_info=True under the hood, which serializes
the full traceback including local variable bindings and chained
exceptions via __traceback__. Driver-level exceptions (asyncpg) can
carry bound parameter values — exactly what T1.4's middleware was built
to close. But both call sites are INSIDE handlers and never reach the
middleware.

Audit found 2 sites (not 1 as CISO listed):
- routes.py:125 inside validate_datadog_credential
- obs_rollup_worker.py:76 inside _run_one_cycle wrapper

Both replaced with logger.error("... err_class=%s", type(exc).__name__).
New regression test fails if logger.exception sneaks back in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SO FIND-003)

Catching bare Exception swallowed FastAPI's HTTPException (and its
Starlette parent), converting every legitimate 4xx into an opaque 500.
That kills Pydantic 422s, InvalidSiteError 422, get_provider_metadata
404s, and rotation 503s — all of which clients are supposed to read.

Fix: re-raise StarletteHTTPException explicitly before the sanitize
branch. FastAPI's HTTPException subclasses it, so one catch handles both.
The Exception branch keeps doing its job for genuine driver/unexpected
errors. +2 tests in test_exception_middleware.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The migration 023_obs_pii_trigger_recursive.py and the Python
_anti_surveillance.FORBIDDEN_FIELD_NAMES are kept in lockstep by
the comment in both files — but no test enforces it. They match
today; the moment someone updates one side, the layers silently
diverge. New test parses the migration's _FORBIDDEN_KEYS tuple
via ast and asserts set-equality with the Python frozenset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…002)

The module docstring claimed "Plaintext decrypt happens INSIDE postgres
... in a single transaction" — misleading. The script DOES decrypt
inside postgres, but the plaintext returns to Python heap to compute
the M-005 fingerprint before being re-encrypted in a second statement.
Two separate transactions per row, plaintext transits Python briefly.

The script is correct; the documentation lied about the threat model.
Rewrites the docstring + adds a "What rotation does (and the threat
model)" section to the runbook with the honest 5-step data flow +
explicit residence/risk statement. Adds `del drow` after the UPDATE
to make the plaintext residence window explicit in code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nascimentolimaandre-cloud nascimentolimaandre-cloud marked this pull request as ready for review May 15, 2026 18:50
@nascimentolimaandre-cloud nascimentolimaandre-cloud merged commit 4bef55a into main May 15, 2026
4 checks passed
@nascimentolimaandre-cloud nascimentolimaandre-cloud deleted the feat/obs-001-phase1-backend-fixes branch May 15, 2026 19:30
nascimentolimaandre-cloud pushed a commit that referenced this pull request May 15, 2026
Five files were authored in the OneDrive clone before the dev environment
migrated to ~/Code and never made it into git:

- docs/security-reviews/FDD-OBS-001-phase1-review.md
    CISO's Phase 1 review (Conditional Approve, 4 findings) — the
    document that drove PR #30's 4 must-fix commits. Important
    historical record of the review.

- docs/ux-specs/_pulse-web-component-gap.md
- docs/ux-specs/_pulse-web-component-inventory.md
    pulse-ux-reviewer audit of pulse-web's current component coverage
    vs. what FDD-OBS-001 Phase 3+ will need — hand-off to pulse-engineer.

- docs/ux-specs/observability-aliases-impl-spec.md   (Phase 4 UI spec)
- docs/ux-specs/observability-ownership-impl-spec.md (Phase 3 UI spec)
    pulse-ux-reviewer impl specs for the React surfaces the next phases
    will componentize against the design system.

No code changes — docs-only. Closes the OneDrive → ~/Code migration gap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant