Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ the first consumer-visible behaviour change and will drive the next SDK version
- **Automation-detector wording**: the anti-tamper flag now reads "Anti-tamper signals" (not "Automation detected") when the combined confidence is weak — e.g. devtools open in a dev environment — so a human isn't labelled a bot. The machine-readable `code` (`automation_suspected`) is unchanged; the reason text also drops the `tamper.` prefix for readability.

### Internal
- **Error tracking (Sentry)** ([ADR-0006](docs/adr/0006-observability-sentry.md)): the hosted server was *instrumented but blind* — OpenTelemetry was wired but ships disabled (`OTEL_SDK_DISABLED=true`) and `pino` logs are ephemeral, so prod errors and outages went unseen. Added `@sentry/node` (v10) error capture + alerting for the server and worker. A new `instrument.ts` runs `Sentry.init` (preloaded via `--import` before app modules) and **no-ops without `SENTRY_DSN`** (`${SCENT_SECRET_KEY:-}`-style "env unset = disabled"), so dev/test/self-host stay inert. Express errors are caught via `setupExpressErrorHandler`; BullMQ job failures via explicit `captureException` in the worker `failed` handlers (+ `flush` on shutdown). Privacy posture for this PII-sensitive product: **EU-region project + strict scrubbing** (`sendDefaultPii: false` plus an exported, unit-tested `scrubPii` `beforeSend` that strips request bodies, cookies, the `x-api-key`/`cookie`/`authorization` headers, query string, and client IP). Errors-only by default (`tracesSampleRate` 0, env-overridable). New env `SENTRY_DSN`/`SENTRY_ENVIRONMENT`/`SENTRY_RELEASE`/`SENTRY_TRACES_SAMPLE_RATE` in the deploy `.env.example`, both compose services, and the runbook (with an external `/health` uptime-monitor note). Distributed traces/metrics/off-box logs to a managed backend remain a deferred phase 2 (the OTel wiring already exists).
- **Organizations (multi-tenant) layer** ([ADR-0005](docs/adr/0005-organizations-and-tenancy.md), migrations 013–014): a new `organizations` table is now the tenant boundary above `projects`. The admin `owner` role is **re-scoped from a global superuser to org-scoped** — `canViewProject`/`canManageProject`, the `/admin/*` listing queries, and `requireProjectRead` filter by `organization_id`, and a cross-org project/user id returns `404` (no existence leak). 2FA policy moved from the install-wide `admin_settings.require_2fa` to per-org `organizations.require_2fa`; invites carry the inviter's org so an accepted account joins that company. Migration 013 backfills a single `Default` org for existing installs, so **self-host is unchanged** (one auto-created org); the `NOT NULL` FK backstop lands in migration 014 once every writer is org-aware. `create-admin`/`create-project` take an optional `[orgName]` (shared `findOrCreateOrgByName`). `organization_id` stays off the `/v1` data path (a project key already scopes data) — orgs are an admin/billing concern, the foundational prerequisite for hosted metering/billing. Public self-serve signup is deferred to the billing workstream.
- **Docs: GDPR & consent** ([ADR-0004](docs/adr/0004-consent-and-data-lifecycle.md)): new [GDPR & consent integration guide](docs/integrations/gdpr-consent.md) (controller/processor split, CMP wiring per mode, lawful-basis guidance, data-subject rights, DPA stub); OpenAPI updated with the snapshot consent fields, the `LawfulBasis` schema, and the `DELETE`/`export` identity paths.
- **Retention sweeper** ([ADR-0004](docs/adr/0004-consent-and-data-lifecycle.md)): a daily BullMQ repeatable job in the worker deletes identities (and, by cascade, their snapshots/drifts/risk/links) whose `last_seen` is older than their project's `retention_days`; projects with a null `retention_days` keep data indefinitely. `sweepRetention()` is idempotent and unit-tested against a real DB.
Expand Down
13 changes: 13 additions & 0 deletions deploy/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,19 @@ SCENT_SECRET_KEY=change-me
# Leave blank for an API-only deploy (server-to-server traffic sends no Origin).
CORS_ALLOWED_ORIGINS=

# Error tracking (Sentry). Unset = disabled (no events leave the box). The DSN is a
# write-only ingest key, not a secret of the cloud-token class — safe to paste here.
# IMPORTANT: create the Sentry project in the EU region (this is a PII-sensitive
# product); the server scrubs request bodies, cookies, and auth headers before sending
# (see deploy/README.md and docs/adr/0006-observability-sentry.md).
SENTRY_DSN=
# Environment tag shown in Sentry. Defaults to NODE_ENV (production here) if unset.
SENTRY_ENVIRONMENT=production
# Release identifier for grouping/regressions — set to the image tag or git SHA.
SENTRY_RELEASE=
# Performance-trace sample rate (0.0–1.0). Default 0 = errors only, no perf traces.
SENTRY_TRACES_SAMPLE_RATE=0

# Image tag to run. Default "latest"; pin to a commit SHA for reproducible deploys.
SCENT_IMAGE_TAG=latest

Expand Down
28 changes: 28 additions & 0 deletions deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,34 @@ persistence is enabled here). Combined with the `event_id` dedupe in the worker,
that gives at-least-once processing across restarts. For stronger guarantees a
Postgres outbox would be the next step.

## Observability: error tracking + uptime

Out of the box the server emits structured `pino` logs (visible via `docker compose logs
-f scent-server`) but they are ephemeral, and there is no alerting. Two low-effort layers
close that gap.

**Error tracking (Sentry).** Set `SENTRY_DSN` in `.env` and the server + worker report
unhandled errors (with stack traces and request/job context) to Sentry; leave it unset and
the SDK stays completely inert (no events leave the box). Setup:

1. Create a Sentry project **in the EU region** (Settings → choose EU when creating the
org/project). This is a PII-sensitive product — keep error data in the EU.
2. Copy the project's DSN into `.env` as `SENTRY_DSN=` and optionally set `SENTRY_RELEASE`
to the image tag/SHA you're running. `docker compose pull && docker compose up -d`.
3. In Sentry, add an **alert rule** (e.g. notify on a new issue / error-rate spike).

The DSN is a write-only ingest key, **not** a cloud/API token — safe to keep in `.env`.
Before any event is sent the server scrubs request bodies (POST `/v1/events` carries raw
fingerprint signals = PII), cookies, the `x-api-key`/`cookie`/`authorization` headers, the
query string, and the client IP (`sendDefaultPii: false` plus an explicit `beforeSend` —
see [docs/adr/0006-observability-sentry.md](../docs/adr/0006-observability-sentry.md)).
Distributed traces/metrics to a managed backend are a deferred phase 2 (the OTel wiring
already exists but ships disabled via `OTEL_SDK_DISABLED=true`).

**Uptime.** Sentry can't tell you the box is hard-down. Point an external monitor (Better
Stack / UptimeRobot free tier) at `https://<your-domain>/health` — it returns
`{"status":"ok",...}` — with a 1-minute interval and alerting to the same channel.

## Optional: GeoIP (impossible-travel detection)

The `impossible_transition` risk flag — IP geolocation moving faster than a flight
Expand Down
10 changes: 9 additions & 1 deletion deploy/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ services:
CORS_ALLOWED_ORIGINS: ${CORS_ALLOWED_ORIGINS:-}
SCENT_SECRET_KEY: ${SCENT_SECRET_KEY:-}
OTEL_SDK_DISABLED: "true"
SENTRY_DSN: ${SENTRY_DSN:-}
SENTRY_ENVIRONMENT: ${SENTRY_ENVIRONMENT:-production}
SENTRY_RELEASE: ${SENTRY_RELEASE:-}
SENTRY_TRACES_SAMPLE_RATE: ${SENTRY_TRACES_SAMPLE_RATE:-0}
healthcheck:
test: ["CMD", "node", "-e", "require('http').get('http://localhost:3000/health',r=>process.exit(r.statusCode===200?0:1)).on('error',()=>process.exit(1))"]
interval: 15s
Expand All @@ -59,13 +63,17 @@ services:
restart: unless-stopped
# Same image, different entrypoint: drains the BullMQ ingest queue. Scale with
# `docker compose up -d --scale scent-worker=N` (no ports/name to collide).
command: ["node", "--import", "./dist/tracing.js", "dist/worker.js"]
command: ["node", "--import", "./dist/instrument.js", "--import", "./dist/tracing.js", "dist/worker.js"]
environment:
DATABASE_URL: postgresql://scent:${POSTGRES_PASSWORD}@postgres:5432/scent
REDIS_URL: redis://redis:6379
NODE_ENV: production
WORKER_CONCURRENCY: "5"
OTEL_SDK_DISABLED: "true"
SENTRY_DSN: ${SENTRY_DSN:-}
SENTRY_ENVIRONMENT: ${SENTRY_ENVIRONMENT:-production}
SENTRY_RELEASE: ${SENTRY_RELEASE:-}
SENTRY_TRACES_SAMPLE_RATE: ${SENTRY_TRACES_SAMPLE_RATE:-0}
depends_on:
postgres:
condition: service_healthy
Expand Down
87 changes: 87 additions & 0 deletions docs/adr/0006-observability-sentry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# ADR-0006: Sentry-led error tracking now; OTel traces/logs to a backend deferred

**Status:** Accepted
**Date:** 2026-06-21

## Context

The hosted box (`api.scent.tindalabs.dev`) was **instrumented but blind**. The server is
fully wired for OpenTelemetry ([tracing.ts](../../packages/server/src/tracing.ts): NodeSDK +
auto-instrumentations + OTLP exporter) and emits `pino` logs with `trace_id`/`span_id`
correlation — but prod sets `OTEL_SDK_DISABLED=true` on both services
([deploy/docker-compose.yml](../../deploy/docker-compose.yml)), so every trace is dropped and
logs go only to `docker logs` (ephemeral, no search, no alerting). There was **no error
tracking, no alerting, no uptime monitoring**: if prod threw or the box went down, nothing
told us. With live design-partner traffic on the box, that is the gap to close first.

The OTel wiring is disabled rather than removed deliberately: standing up a managed OTLP
backend (Grafana Cloud / Honeycomb / Dash0), reconciling sampling/cost, and shipping logs
off-box is a larger project than "tell me the moment prod breaks, with a stack trace."

## Decision

**Sentry-led.** Add error tracking + alerting via `@sentry/node` now; defer turning on the
existing OTel traces/logs to a managed backend to an explicit **phase 2**.

Sentry is the fastest path to actionable production errors for a small team: SDK + DSN +
alert rule, stack traces with request/job context, issue grouping and regression detection,
deploy-aware via release tags. Because this is a PII-sensitive fingerprinting product in the
EU under BSL, the posture is **Sentry EU region + strict PII scrubbing**.

### Phase 1 (this ADR — built)

- **`@sentry/node` v10** (the current major; sets up its own OpenTelemetry under the hood —
see coexistence note below). Added to the server package.
- **[instrument.ts](../../packages/server/src/instrument.ts)** runs `Sentry.init` at module
level and **no-ops without `SENTRY_DSN`** — mirroring the `${SCENT_SECRET_KEY:-}`
"env unset = feature disabled" convention, so dev, test, and self-host stay completely
inert (every `Sentry.*` call becomes a no-op when init never ran).
- **PII scrubbing**: `sendDefaultPii: false` plus an exported, unit-tested `beforeSend`
(`scrubPii`) that strips the request body (POST `/v1/events` bodies carry raw fingerprint
signals = PII), cookies, the `x-api-key`/`cookie`/`authorization` headers, the query
string, and the client IP. Defense in depth: the explicit strip holds even if a future SDK
default changes.
- **Capture surface**: `Sentry.setupExpressErrorHandler(app)` after all routes
([app.ts](../../packages/server/src/app.ts)) for sync throws / `next(err)`; the default
global handlers for unhandled rejections; explicit `Sentry.captureException` in the
worker's BullMQ `failed` handlers (BullMQ swallows the throw into the event, so the global
handlers never see it) plus `Sentry.flush(2000)` in worker `shutdown()`.
- **Preload**: `--import ./dist/instrument.js` before `./dist/tracing.js` in the server
Dockerfile `CMD`, the worker compose `command`, and `worker:start`, so Sentry patches
before app modules load. The dev/`tsx` path gets it via a top-of-file import in
index.ts/worker.ts.
- **Errors-only by default**: `tracesSampleRate` defaults to 0 (env-overridable).
- **Uptime**: an external monitor on `/health` (Better Stack / UptimeRobot) catches a
hard-down box Sentry can't — an ops step (runbook), not code.

The DSN is a write-only ingest key (not a cloud/API token of the class the operator avoids),
so it is fine to keep in `.env`.

### Phase 2 (deferred — NOT built here)

Turn on the existing OTel traces/logs to a managed **EU** OTLP backend: set
`OTEL_EXPORTER_OTLP_ENDPOINT`, flip `OTEL_SDK_DISABLED=false`, ship `pino` logs off-box.
Optional source-map upload for readable minified stack traces — needs a `SENTRY_AUTH_TOKEN`
(a token-class CI secret), so it stays gated/deferred. Optional `@sentry/profiling-node`.

## Key technical nuance: Sentry vs. the app's OTel

Sentry Node v8+/v10 stands up its **own** OpenTelemetry instance. In phase 1 there is **no
conflict** because the two are mutually exclusive by config: the app's OTel is off in prod
(where Sentry runs), and Sentry is off everywhere the app's OTel is on (dev/self-host with a
local collector). Phase 2 must reconcile them — either `skipOpenTelemetrySetup: true` on the
Sentry init and register Sentry's span processor on the app's `NodeSDK`, or let Sentry own
OTel and export onward. Documented here so the phase-2 implementer doesn't double-initialise.

## Consequences

- Prod errors are now visible with stack traces and context, with alerting — the core
operational blind spot is closed.
- Privacy posture is explicit and auditable (EU residency + scrubbing), consistent with
[ADR-0004](0004-consent-and-data-lifecycle.md) (data lifecycle) and the BSL
"Tindalabs-hosted only" model.
- Distributed tracing / metrics / off-box logs remain deferred; the wiring already exists,
so phase 2 is a config + reconciliation task, not a rebuild.

Relates to [ADR-0003](0003-otel-bridge.md) (the OTel bridge) and
[ADR-0004](0004-consent-and-data-lifecycle.md).
1 change: 1 addition & 0 deletions docs/adr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ Each ADR documents a significant architectural choice: the context, the decision
| [0003](0003-otel-bridge.md) | OTel traceparent bridge for blindspot-ux composability | Accepted |
| [0004](0004-consent-and-data-lifecycle.md) | Consent is the controller's responsibility; the SDK enforces, never triggers | Accepted |
| [0005](0005-organizations-and-tenancy.md) | Organizations are the tenant boundary; owner is org-scoped, not global | Accepted |
| [0006](0006-observability-sentry.md) | Sentry-led error tracking now; OTel traces/logs to a backend deferred | Accepted |
2 changes: 1 addition & 1 deletion packages/server/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,4 @@ ENV NODE_ENV=production
WORKDIR /app/packages/server
EXPOSE 3000

CMD ["node", "--import", "./dist/tracing.js", "dist/index.js"]
CMD ["node", "--import", "./dist/instrument.js", "--import", "./dist/tracing.js", "dist/index.js"]
3 changes: 2 additions & 1 deletion packages/server/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"create-project": "tsx src/scripts/create-project.ts",
"create-admin": "tsx src/scripts/create-admin.ts",
"worker": "tsx src/worker.ts",
"worker:start": "node --import ./dist/tracing.js dist/worker.js",
"worker:start": "node --import ./dist/instrument.js --import ./dist/tracing.js dist/worker.js",
"test": "vitest run",
"test:coverage": "vitest run --coverage",
"type-check": "tsc --noEmit",
Expand All @@ -24,6 +24,7 @@
"@opentelemetry/resources": "^2.7.1",
"@opentelemetry/sdk-node": "^0.218.0",
"@opentelemetry/semantic-conventions": "^1.41.1",
"@sentry/node": "^10.59.0",
"@tindalabs/scent-engine": "workspace:*",
"bcryptjs": "^3.0.3",
"bullmq": "^5.78.0",
Expand Down
7 changes: 7 additions & 0 deletions packages/server/src/app.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import express, { type Express } from 'express';
import * as Sentry from '@sentry/node';
import cors from 'cors';
import cookieParser from 'cookie-parser';
import { rateLimitMiddleware, adminRateLimitMiddleware } from './middleware/rate-limit.js';
Expand Down Expand Up @@ -85,5 +86,11 @@ export function createApp(): Express {
app.use('/v1/account', requireProjectRead, accountRouter);
app.use('/v1/accounts', requireProjectRead, accountsRouter);

// Sentry error capture, after all routes. No-op until Sentry.init runs (which only
// happens with SENTRY_DSN set — see instrument.ts), so dev/test/self-host are
// unaffected. Catches sync throws and next(err); the global unhandled-rejection
// integration covers async route rejections that bubble past Express.
Sentry.setupExpressErrorHandler(app);

return app;
}
4 changes: 4 additions & 0 deletions packages/server/src/index.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
// Imported first so Sentry.init runs before any instrumented module loads. In the
// built image this is redundant with `--import ./dist/instrument.js` (Dockerfile CMD);
// here it covers the dev/tsx path. No-ops without SENTRY_DSN.
import './instrument.js';
import { startTracing } from './tracing.js';
startTracing();

Expand Down
62 changes: 62 additions & 0 deletions packages/server/src/instrument.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import { describe, it, expect } from 'vitest';
import type { ErrorEvent } from '@sentry/node';
import { scrubPii } from './instrument.js';

// scrubPii is the privacy boundary for error reporting: this is a fingerprinting
// product, so a stack trace must never carry a subject's raw signals or an API key.
// Verified directly rather than trusted to integration coverage.
describe('scrubPii', () => {
it('strips the request body, cookies, and query string', () => {
const event = {
request: {
data: { fingerprint: 'raw-device-signals', email: 'user@example.com' },
cookies: { scent_admin: 'session-token' },
query_string: 'token=secret',
},
} as ErrorEvent;

const out = scrubPii(event);

expect(out.request?.data).toBeUndefined();
expect(out.request?.cookies).toBeUndefined();
expect(out.request?.query_string).toBeUndefined();
});

it('strips sensitive headers case-insensitively but keeps benign ones', () => {
const event = {
request: {
headers: {
'X-Api-Key': 'pk_live_abc',
Cookie: 'scent_admin=tok',
Authorization: 'Bearer xyz',
'Content-Type': 'application/json',
'user-agent': 'curl/8',
},
},
} as unknown as ErrorEvent;

const headers = scrubPii(event).request?.headers ?? {};

expect(headers['X-Api-Key']).toBeUndefined();
expect(headers['Cookie']).toBeUndefined();
expect(headers['Authorization']).toBeUndefined();
expect(headers['Content-Type']).toBe('application/json');
expect(headers['user-agent']).toBe('curl/8');
});

it('strips the client IP from user context', () => {
const event = {
user: { id: 'admin-1', ip_address: '203.0.113.7' },
} as unknown as ErrorEvent;

const out = scrubPii(event);

expect(out.user?.ip_address).toBeUndefined();
expect(out.user?.id).toBe('admin-1'); // non-PII identifier retained
});

it('is a no-op on an event with no request or user', () => {
const event = { message: 'boom' } as ErrorEvent;
expect(scrubPii(event)).toEqual({ message: 'boom' });
});
});
Loading
Loading