diff --git a/.claude/plans/alerts-and-notifications-improvements.md b/.claude/plans/alerts-and-notifications-improvements.md new file mode 100644 index 00000000000..4b55485707d --- /dev/null +++ b/.claude/plans/alerts-and-notifications-improvements.md @@ -0,0 +1,1105 @@ +# Alerts & Notifications Improvement Proposal + +## Motivation + +The existing alerts system only supports schema change notifications via Slack, Webhook, and MS +Teams. Users have no way to be alerted when their GraphQL API's latency spikes, error rate degrades, +or traffic volume changes unexpectedly. These are the most valuable alert types from a production +monitoring perspective. + +We also lack email as a notification channel, which is table stakes for an alerting system. + +This proposal covers two workstreams: + +1. **Metric-based alerts** — Add configurable alerts for latency, error rate, and traffic with + periodic evaluation against ClickHouse data +2. **Email alert channel** — Add EMAIL as a new notification channel type, benefiting both existing + schema-change alerts and the new metric alerts (deferred until after Phase 1 ships; design + decisions around team-member vs. group-email handling are still being finalized) + +### Design decisions made so far + +- **Resolved notifications**: Send a "resolved" notification when a metric alert transitions from + FIRING through RECOVERING back to NORMAL +- **Alert scoping**: Metric alerts can optionally be scoped to a specific insights filter (operation + IDs and/or client name+version combinations) +- **Multiple email recipients**: Email channels support an array of addresses +- **Severity levels**: Alerts carry a user-defined severity label (info, warning, critical) for + organizational purposes +- **Four-state alert lifecycle**: Alerts transition through NORMAL → PENDING → FIRING → RECOVERING → + NORMAL. The pending and recovering states require the condition to hold for a configurable number + of consecutive minutes (`confirmation_minutes`) before escalating or resolving, preventing + flapping (rapid normal→firing→normal→firing cycles). All four states are user-facing. + +--- + +## Pre-flight review (pre-implementation validation) + +The plan was reviewed against the current codebase on 2026-04-15. The following items were validated +and are reflected in the sections below. Anything an implementer should double-check before their +first commit is called out explicitly. + +**Validated as matching current conventions:** + +- Migration file naming (`YYYY.MM.DDTHH-MM-SS.description.ts`) and registration via `await import()` + in `run-pg-migrations.ts`. +- `MigrationExecutor` interface supports `noTransaction: true` — correct for the Phase 2 + `ALTER TYPE ... ADD VALUE` migration. +- `uuid_generate_v4()` is the default-UUID convention (not `gen_random_uuid()`). Adjusted + throughout. +- `saved_filters` table exists (migration `2026.02.07T00-00-00.saved-filters.ts`) with `id UUID`, + `project_id UUID`, `filters JSONB`. FK from `metric_alert_rules.saved_filter_id` is safe. +- `alert:modify` permission exists at `packages/services/api/src/modules/auth/lib/authz.ts` + (~line 405) and is already used by the existing `AlertsManager`. Reused for metric alert rules — + no new permission needed. +- Mutation result shape (`{ ok, error }` discriminated union) matches + `saved-filters/resolvers/Mutation/createSavedFilter.ts`. +- Module-level storage providers with `@Inject(PG_POOL_CONFIG)` are the modern pattern + (saved-filters, proposals, oidc-integrations all follow it). Legacy + `packages/services/storage/src/index.ts` is intentionally not extended. +- Graphile-Worker cron deduplication is automatic — an overrunning evaluation task will not spawn a + concurrent instance on the next tick. +- `send-webhook.ts` in workflows already uses `got` + retries; reusable for metric-alert webhook + delivery with zero infrastructure changes. +- Email provider abstraction (`context.email.send({ to, subject, body })`) in workflows is reusable + for Phase 2 email delivery. + +**Gaps the implementer must close during Phase 1:** + +1. **Add `@slack/web-api` to `packages/services/workflows/package.json`.** Currently only the API + package has it. Pin to the same version as the API package for parity. +2. **Cross-scope validation in mutation resolvers.** The FKs on `metric_alert_rules` cannot enforce + that all `channelIds` and the `saved_filter_id` belong to the same project as the rule's target. + Explicit validation in `addMetricAlertRule` / `updateMetricAlertRule` is required. See section + 1.3 for details. +3. **Regenerate GraphQL types.** Run `pnpm graphql:generate` from repo root after schema edits. + Generated files appear in `packages/services/api/src/__generated__/types.ts` and each module's + `resolvers.generated.ts` — do not hand-edit. +4. **Add `alertStateLogRetentionDays` to commerce plan constants.** The retention column on + `metric_alert_state_log` uses per-plan values (7d Hobby/Pro, 30d Enterprise) driven from + `packages/services/api/src/modules/commerce/constants.ts`. + +--- + +## Phase 1: Metric-Based Alerts + +### Why a new table? + +The existing `alerts` table is tightly coupled to schema-change notifications — it has a fixed +`alert_type` enum with only `SCHEMA_CHANGE_NOTIFICATIONS`, and no configuration columns. Metric +alerts need substantially different configuration: time windows, metric selectors, threshold +types/values, comparison direction, severity, evaluation state, and optional operation/client +filters. + +A new `metric_alert_rules` table keeps both systems clean and independently evolvable. It reuses the +existing `alert_channels` table (Slack, Webhook, MS Teams) for notification delivery. EMAIL support +will be added in Phase 2. + +### Conventions to follow (validated against codebase) + +These were verified during a final pre-implementation review against current patterns in the repo. + +- **Migration filename format**: `YYYY.MM.DDTHH-MM-SS.description.ts` (matches latest migrations + like `2026.03.25T00-00-00.access-token-expiration.ts`). +- **Migration registration**: Use the dynamic `await import('./actions/...')` pattern in + `packages/migrations/src/run-pg-migrations.ts` (the convention adopted post-Dec 2024). Static + imports exist in older entries but new migrations should use the async form. +- **UUID defaults**: Use `uuid_generate_v4()` (the existing convention) — not `gen_random_uuid()`. + The uuid extension is enabled in the base schema migration. +- **Timestamp columns**: `created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()` and an + `updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()` are present on most modern tables. Apply this to + `metric_alert_rules` (we already have `created_at`; add `updated_at`). +- **Module-level data access**: Inject `PG_POOL_CONFIG` directly via + `@Inject(PG_POOL_CONFIG) private pool: DatabasePool` in an + `@Injectable({ scope: Scope.Operation })` class. Mirror the shape of + `packages/services/api/src/modules/saved-filters/providers/saved-filters-storage.ts`. Do NOT + extend the legacy `packages/services/storage/src/index.ts`. +- **Permissions**: Reuse the existing `alert:modify` action + (`packages/services/api/src/modules/auth/lib/authz.ts` ~line 405) for both reads and writes of + metric alert rules, matching how the existing `AlertsManager` handles channels and schema-change + alerts. Do not introduce a new `alert:read` permission — that would expand authz scope + unnecessarily. +- **Mutation result shape**: Use the discriminated `{ ok: { ... } } | { error: { message } }` + pattern as shown in + `packages/services/api/src/modules/saved-filters/resolvers/Mutation/createSavedFilter.ts`. +- **Connection pattern**: Mirror `SavedFilterConnection` for `MetricAlertRuleIncidentConnection` and + any other paginated connections. +- **GraphQL codegen**: Run `pnpm graphql:generate` from repo root after schema edits. Generated + resolver types appear under `packages/services/api/src/__generated__/types.ts` and module + `resolvers.generated.ts` files (do NOT hand-edit those). +- **GraphQL mappers**: Add `MetricAlertRuleMapper`, `MetricAlertRuleIncidentMapper`, + `MetricAlertRuleStateChangeMapper` to + `packages/services/api/src/modules/alerts/module.graphql.mappers.ts`, pointing at internal entity + types defined in `packages/services/api/src/shared/entities.ts`. + +### 1.1 Database Migration + +**New file:** `packages/migrations/src/actions/2026.04.15T00-00-01.metric-alert-rules.ts` + +```sql +CREATE TYPE metric_alert_type AS ENUM('LATENCY', 'ERROR_RATE', 'TRAFFIC'); + +CREATE TYPE metric_alert_metric AS ENUM('avg', 'p75', 'p90', 'p95', 'p99'); + +CREATE TYPE metric_alert_threshold_type AS ENUM('FIXED_VALUE', 'PERCENTAGE_CHANGE'); + +CREATE TYPE metric_alert_direction AS ENUM('ABOVE', 'BELOW'); + +CREATE TYPE metric_alert_severity AS ENUM('INFO', 'WARNING', 'CRITICAL'); + +CREATE TYPE metric_alert_state AS ENUM('NORMAL', 'PENDING', 'FIRING', 'RECOVERING'); + +-- Alert configuration (what to monitor and how) +CREATE TABLE metric_alert_rules ( + id UUID PRIMARY KEY DEFAULT uuid_generate_v4 (), + organization_id UUID NOT NULL REFERENCES organizations (id) ON DELETE CASCADE, + project_id UUID NOT NULL REFERENCES projects (id) ON DELETE CASCADE, + target_id UUID NOT NULL REFERENCES targets (id) ON DELETE CASCADE, + type metric_alert_type NOT NULL, + time_window_minutes INT NOT NULL DEFAULT 30, + metric metric_alert_metric, -- only for LATENCY type + threshold_type metric_alert_threshold_type NOT NULL, + threshold_value DOUBLE PRECISION NOT NULL, + direction metric_alert_direction NOT NULL DEFAULT 'ABOVE', + severity metric_alert_severity NOT NULL DEFAULT 'WARNING', + name TEXT NOT NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + enabled BOOLEAN NOT NULL DEFAULT TRUE, + last_evaluated_at TIMESTAMPTZ, + last_triggered_at TIMESTAMPTZ, -- denormalized; updated on PENDING → FIRING transitions + -- Alert state machine + state metric_alert_state NOT NULL DEFAULT 'NORMAL', + state_changed_at TIMESTAMPTZ, -- when the current state began + -- How many consecutive minutes the condition must hold before + -- PENDING → FIRING or RECOVERING → NORMAL (prevents flapping) + confirmation_minutes INT NOT NULL DEFAULT 5, + -- Optional insights filter scoping. References a saved filter by ID; the UI picks a saved + -- filter by name (see packages/services/api/src/modules/saved-filters/). If the saved filter + -- is later edited, the rule automatically uses the updated filter contents at evaluation time. + -- If the saved filter is deleted, the rule becomes unscoped (applies to the whole target). + -- + -- IMPORTANT: saved_filters are project-scoped (saved_filters.project_id), not target-scoped. + -- The addMetricAlertRule / updateMetricAlertRule resolvers MUST validate + -- `savedFilter.projectId === target.projectId` before writing, since the FK alone cannot + -- enforce this cross-column invariant. See the pattern used in + -- packages/services/api/src/modules/saved-filters/providers/saved-filters.provider.ts + -- around the `projectId !== target.projectId` check. + saved_filter_id UUID REFERENCES saved_filters (id) ON DELETE SET NULL, + CONSTRAINT metric_alert_rules_metric_required CHECK ( + ( + type = 'LATENCY' + AND metric IS NOT NULL + ) + OR ( + type != 'LATENCY' + AND metric IS NULL + ) + ) +); + +CREATE INDEX idx_metric_alert_rules_enabled ON metric_alert_rules (enabled) +WHERE + enabled = TRUE; + +-- Many-to-many: each rule can notify multiple channels; each channel can serve multiple rules. +-- Matches the UI where users add multiple destinations per alert (e.g. Slack #alerts + Email). +CREATE TABLE metric_alert_rule_channels ( + metric_alert_rule_id UUID NOT NULL REFERENCES metric_alert_rules (id) ON DELETE CASCADE, + alert_channel_id UUID NOT NULL REFERENCES alert_channels (id) ON DELETE CASCADE, + PRIMARY KEY (metric_alert_rule_id, alert_channel_id) +); + +CREATE INDEX idx_marc_channel ON metric_alert_rule_channels (alert_channel_id); + +-- Alert incident history (each time an alert fires and resolves) +CREATE TABLE metric_alert_incidents ( + id UUID PRIMARY KEY DEFAULT uuid_generate_v4 (), + metric_alert_rule_id UUID NOT NULL REFERENCES metric_alert_rules (id) ON DELETE CASCADE, + started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + resolved_at TIMESTAMPTZ, -- NULL while still firing + current_value DOUBLE PRECISION NOT NULL, + previous_value DOUBLE PRECISION, + threshold_value DOUBLE PRECISION NOT NULL -- snapshot of threshold at time of incident +); + +CREATE INDEX idx_metric_alert_incidents_alert ON metric_alert_incidents (metric_alert_rule_id); + +CREATE INDEX idx_metric_alert_incidents_open ON metric_alert_incidents (metric_alert_rule_id) +WHERE + resolved_at IS NULL; + +-- State transition log (powers alert history UI: event list, bar chart, state timeline) +-- Retention: 7 days for Hobby/Pro, 30 days for Enterprise (set at insert time via expires_at) +CREATE TABLE metric_alert_state_log ( + id UUID PRIMARY KEY DEFAULT uuid_generate_v4 (), + metric_alert_rule_id UUID NOT NULL REFERENCES metric_alert_rules (id) ON DELETE CASCADE, + target_id UUID NOT NULL REFERENCES targets (id) ON DELETE CASCADE, + from_state metric_alert_state NOT NULL, + to_state metric_alert_state NOT NULL, + -- Value snapshot at the time of transition (for historical accuracy even if rule changes) + VALUE DOUBLE PRECISION, -- current window's metric value + previous_value DOUBLE PRECISION, -- previous window's metric value + threshold_value DOUBLE PRECISION, -- snapshot of rule.threshold_value at transition time + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + expires_at TIMESTAMPTZ NOT NULL -- set based on org plan at insert time +); + +CREATE INDEX idx_metric_alert_state_log_rule ON metric_alert_state_log (metric_alert_rule_id, created_at); + +CREATE INDEX idx_metric_alert_state_log_target ON metric_alert_state_log (target_id, created_at); + +CREATE INDEX idx_metric_alert_state_log_expires ON metric_alert_state_log (expires_at); +``` + +### 1.2 Data Access + +**New file:** `packages/services/api/src/modules/alerts/providers/metric-alert-rules-storage.ts` + +Module-level provider with direct `PG_POOL_CONFIG` injection (the modern pattern used by recent +modules like app-deployments, schema-proposals, saved-filters — not the legacy storage service). +Provides: + +Alert configuration CRUD: + +- `addMetricAlertRule(input)`, `updateMetricAlertRule(id, fields)`, `deleteMetricAlertRules(ids)` +- `getMetricAlertRules(projectId)`, `getMetricAlertRulesByTarget(targetId)` +- `getAllEnabledMetricAlertRules()` — used by the workflows evaluation task; returns rules joined + with their channels (via `metric_alert_rule_channels`), saved filter contents (via + `saved_filters.id`), target, project, organization + +Channel management (multi-destination): + +- `setRuleChannels(ruleId, channelIds[])` — replaces the full set of channels attached to a rule + (called from add/update mutations) +- `getRuleChannels(ruleId)` — all channels attached to a rule + +Incident management: + +- `createIncident(ruleId, currentValue, previousValue, thresholdValue)` — called when rule + transitions PENDING → FIRING +- `resolveIncident(ruleId)` — sets `resolved_at` on the open incident when rule transitions + RECOVERING → NORMAL +- `getOpenIncident(ruleId)` — find currently firing incident (where `resolved_at IS NULL`) +- `getIncidentHistory(ruleId, limit, offset)` — paginated history for the UI + +State log (powers alert history UI): + +- `logStateTransition(ruleId, targetId, fromState, toState, value, previousValue, thresholdValue, expiresAt)` + — called on every state transition during evaluation. `thresholdValue` is snapshotted from the + rule at transition time so historical events remain accurate even if the rule is later edited. + `expiresAt` is computed from the org's plan: `NOW() + 7 days` for Hobby/Pro, `NOW() + 30 days` for + Enterprise. +- `getStateLog(ruleId, from, to)` — state changes for a single alert rule within a time range +- `getStateLogByTarget(targetId, from, to)` — all state changes across all rules for a target + +### 1.3 GraphQL API + +**File:** `packages/services/api/src/modules/alerts/module.graphql.ts` + +```graphql +enum MetricAlertRuleType { + LATENCY + ERROR_RATE + TRAFFIC +} +enum MetricAlertRuleMetric { + avg + p75 + p90 + p95 + p99 +} +enum MetricAlertRuleThresholdType { + FIXED_VALUE + PERCENTAGE_CHANGE +} +enum MetricAlertRuleDirection { + ABOVE + BELOW +} +enum MetricAlertRuleSeverity { + INFO + WARNING + CRITICAL +} +enum MetricAlertRuleState { + NORMAL + PENDING + FIRING + RECOVERING +} + +type MetricAlertRule { + id: ID! + name: String! + type: MetricAlertRuleType! + target: Target! + """ + Destinations that receive notifications when this rule fires or resolves. + Backed by the metric_alert_rule_channels join table. + """ + channels: [AlertChannel!]! + timeWindowMinutes: Int! + metric: MetricAlertRuleMetric + thresholdType: MetricAlertRuleThresholdType! + thresholdValue: Float! + direction: MetricAlertRuleDirection! + severity: MetricAlertRuleSeverity! + state: MetricAlertRuleState! + confirmationMinutes: Int! + enabled: Boolean! + lastEvaluatedAt: DateTime + """ + Most recent time this rule transitioned PENDING → FIRING (null if never fired) + """ + lastTriggeredAt: DateTime + createdAt: DateTime! + """ + The saved filter that scopes this rule (null = applies to the whole target). + """ + savedFilter: SavedFilter + """ + Count of state transitions logged for this rule in the given time range. + Powers the "Events" column on the alert rules list. + """ + eventCount(from: DateTime!, to: DateTime!): Int! + """ + The currently open incident, if any + """ + currentIncident: MetricAlertRuleIncident + """ + Past incidents for this alert + """ + incidents(first: Int, after: String): MetricAlertRuleIncidentConnection! +} + +type MetricAlertRuleIncident { + id: ID! + startedAt: DateTime! + resolvedAt: DateTime + currentValue: Float! + previousValue: Float + thresholdValue: Float! +} + +type MetricAlertRuleStateChange { + id: ID! + fromState: MetricAlertRuleState! + toState: MetricAlertRuleState! + """ + Metric value in the current window at transition time + """ + value: Float + """ + Metric value in the previous (comparison) window at transition time + """ + previousValue: Float + """ + Threshold value snapshotted at transition time (survives rule edits) + """ + thresholdValue: Float + createdAt: DateTime! + rule: MetricAlertRule! +} + +extend type MetricAlertRule { + """ + State change history for this rule (powers the state timeline) + """ + stateLog(from: DateTime!, to: DateTime!): [MetricAlertRuleStateChange!]! +} + +extend type Target { + """ + State changes across all alert rules for this target (powers the alert events chart + list) + """ + metricAlertRuleStateLog(from: DateTime!, to: DateTime!): [MetricAlertRuleStateChange!]! +} + +extend type Project { + metricAlertRules: [MetricAlertRule!] +} + +# Mutations (standard ok/error result pattern): +# - addMetricAlertRule(input: AddMetricAlertRuleInput!) +# - updateMetricAlertRule(input: UpdateMetricAlertRuleInput!) +# - deleteMetricAlertRules(input: DeleteMetricAlertRulesInput!) +# +# AddMetricAlertRuleInput / UpdateMetricAlertRuleInput include: +# channelIds: [ID!]! # attach one or more channels (multi-destination) +# savedFilterId: ID # optional FK to a saved filter +# confirmationMinutes: Int # defaults to 5; UI defaults to 0 for long windows +``` + +**New resolver files:** `MetricAlertRule.ts`, `MetricAlertRuleIncident.ts`, +`MetricAlertRuleIncidentConnection.ts`, `MetricAlertRuleStateChange.ts`, +`Mutation/addMetricAlertRule.ts`, `Mutation/updateMetricAlertRule.ts`, +`Mutation/deleteMetricAlertRules.ts`. + +Uses `alert:modify` permission for both reads and writes (no separate `alert:read` exists; this +matches how the existing `AlertsManager` handles schema-change alerts). + +**Cross-scope validation in mutations**: The `addMetricAlertRule` / `updateMetricAlertRule` +resolvers must validate — _before_ writing — that: + +1. All provided `channelIds` belong to the same project as `targetId`. Each `alert_channels` row has + a `project_id` column; join and reject with an error if any channel's project doesn't match. +2. If `savedFilterId` is provided, `savedFilter.projectId === target.projectId`. The FK to + `saved_filters(id)` cannot enforce this cross-column invariant on its own. Follow the validation + pattern already in use at + `packages/services/api/src/modules/saved-filters/providers/saved-filters.provider.ts`. + +Both failures should surface as structured errors via the `{ error: { message } }` result branch, +not thrown exceptions. + +### 1.4 Evaluation Engine (Workflows Service) + +#### Why the workflows service? + +The workflows service (`packages/services/workflows/`) is our existing background job runner, built +on Graphile-Worker with PostgreSQL-backed task queues. It already runs periodic cron jobs (cleanup +tasks, etc.) and one-off tasks scheduled by the API (emails, webhooks). Metric alert evaluation — a +periodic job that queries data and sends notifications — is exactly what this service was built for. + +#### Add ClickHouse to Workflows + +The workflows service currently only has PostgreSQL access. We need to add a lightweight ClickHouse +HTTP client so it can query operations metrics. + +- `packages/services/workflows/src/environment.ts` — add `CLICKHOUSE_HOST`, `CLICKHOUSE_PORT`, + `CLICKHOUSE_USERNAME`, `CLICKHOUSE_PASSWORD` +- `packages/services/workflows/src/context.ts` — add `clickhouse` to Context +- `packages/services/workflows/src/lib/clickhouse-client.ts` — **new**, simple HTTP client built on + `got` (matching the pattern used by + `packages/services/workflows/src/lib/webhooks/send-webhook.ts`). The API's ClickHouse client is + DI-heavy and ships with `agentkeepalive`; we just need raw query execution with retry on 5xx. +- `packages/services/workflows/src/index.ts` — instantiate and inject + +#### Evaluation Task + +**New file:** `packages/services/workflows/src/tasks/evaluate-metric-alert-rules.ts` + +Cron: + +- `* * * * * evaluateMetricAlertRules` (every minute) +- `0 4 * * * purgeExpiredAlertStateLog` (daily at 4:00 AM — deletes rows where `expires_at < NOW()`) + +The ingestion pipeline (API buffer → Kafka → ClickHouse async insert) has a worst-case latency of +~31 seconds, so by the time each 1-minute cron tick fires, the previous minute's data is reliably +available. This gives on-call engineers a worst-case ~2 minute detection time. + +1. Fetch all enabled metric alert rules from PostgreSQL. Single query joins with + `metric_alert_rule_channels` → `alert_channels` (returning an array of channels per rule), + `saved_filters` (via `saved_filter_id`, for filter contents), `targets`, `projects`, + `organizations`. +2. Group by `(target_id, time_window_minutes, saved_filter_id)` to batch ClickHouse queries. Rules + pointing at the same saved filter share a query. +3. Query ClickHouse for current window and previous window (with 1-minute offset to account for + ingestion pipeline latency) +4. Compare metric values against thresholds +5. State machine transitions (using `state` and `state_changed_at` on the alert row): + + **When threshold IS breached:** + + - **NORMAL → PENDING**: set state to PENDING, set `state_changed_at` to now. No notification yet. + - **PENDING (held < confirmation_minutes)**: remain PENDING, no action. + - **PENDING (held >= confirmation_minutes) → FIRING**: create incident row, send alert + notification to **all** channels attached via `metric_alert_rule_channels`, update state + + `state_changed_at`, set `last_triggered_at = NOW()`. + - **FIRING → FIRING**: no notification (prevent spam), update `last_evaluated_at`. + - **RECOVERING → FIRING**: condition failed again before recovery confirmed. Set state back to + FIRING, update `state_changed_at`. No notification (already sent). + + **When threshold is NOT breached:** + + - **NORMAL → NORMAL**: update `last_evaluated_at` only. + - **PENDING → NORMAL**: false alarm — condition didn't hold long enough. Reset state to NORMAL, + update `state_changed_at`. No notification. + - **FIRING → RECOVERING**: set state to RECOVERING, set `state_changed_at` to now. No + notification yet. + - **RECOVERING (held < confirmation_minutes)**: remain RECOVERING, no action. + - **RECOVERING (held >= confirmation_minutes) → NORMAL**: set `resolved_at` on open incident, + send "resolved" notification to **all** channels attached via `metric_alert_rule_channels`, + update state + `state_changed_at`. + +#### ClickHouse Query Design + +**Key optimization**: Fetch both windows and all metrics in a **single query** per +`(target, filter)` group. This serves latency, error rate, and traffic alerts simultaneously and +halves round-trips by returning both the current and previous window in one result. + +The query uses **explicit sliding windows** rather than `toStartOfInterval` bucketing. Using +`toStartOfInterval` would snap to fixed time boundaries, producing partial buckets at the edges (and +potentially 3 rows instead of 2), leading to incorrect metric calculations. Instead, we define two +exact non-overlapping ranges and use a `CASE` expression to label each row: + +``` +now = current time +offset = 1 minute (ingestion pipeline latency buffer) +W = windowMinutes + +currentWindow: [now - offset - W, now - offset) +previousWindow: [now - offset - 2W, now - offset - W) +``` + +```sql +SELECT + CASE + WHEN timestamp >= {currentWindowStart} THEN 'current' + ELSE 'previous' + END as window, + sum(total) as total, + sum(total_ok) as total_ok, + avgMerge(duration_avg) as average, + quantilesMerge(0.75, 0.90, 0.95, 0.99)(duration_quantiles) as percentiles +FROM operations_minutely +WHERE target = {targetId} + AND timestamp >= {previousWindowStart} + AND timestamp < {currentWindowEnd} + [AND hash IN/NOT IN ({operationIds})] + [AND (client_name, client_version) IN/NOT IN ({clientFilters})] +GROUP BY window +ORDER BY window +``` + +Where the boundaries are computed as: + +- `currentWindowEnd = now - 1 minute` +- `currentWindowStart = now - 1 minute - W` +- `previousWindowStart = now - 1 minute - 2W` + +This always returns exactly 2 rows (one per label), each aggregating a complete window with no +partial-bucket artifacts. + +From these: + +- **Latency**: pick the relevant percentile or average from each row +- **Error rate**: `(total - total_ok) / total * 100` per row +- **Traffic**: `total` per row + +#### Threshold Comparison + +- **FIXED_VALUE**: `currentValue > thresholdValue` (or `<` for BELOW direction) +- **PERCENTAGE_CHANGE**: `((currentValue - previousValue) / previousValue) * 100 > thresholdValue` + +**Edge cases:** + +- **Both windows have 0 data**: Skip evaluation entirely (no meaningful comparison possible). +- **Previous window is 0, current is > 0 (PERCENTAGE_CHANGE)**: Division by zero. Fall back to + FIXED_VALUE comparison against the threshold — i.e., check `currentValue > thresholdValue` + directly. This avoids a runtime error while still alerting on a meaningful spike from zero + baseline. +- **Previous window is 0, current is 0 (PERCENTAGE_CHANGE)**: No change, treat as OK. +- **Current window has data but previous doesn't exist** (e.g., alert was just created): Skip + evaluation until both windows have data. + +### 1.5 Notifications from Workflows + +**New file:** `packages/services/workflows/src/lib/metric-alert-notifier.ts` + +Sends notifications directly from the workflows service (the API's DI container is not available +here): + +- **Webhooks**: Reuse `packages/services/workflows/src/lib/webhooks/send-webhook.ts` directly. It + already handles the `RequestBroker` path, retries via `args.helpers.job.attempts`, and uses `got`. + No changes needed to webhook infrastructure. +- **Slack**: Instantiate `new WebClient(token)` from `@slack/web-api` and call + `client.chat.postMessage(...)`. The token is fetched from PostgreSQL: + `SELECT slack_token FROM organizations WHERE id = $1`. Mirror the message formatting from + `packages/services/api/src/modules/alerts/providers/adapters/slack.ts` (color-coded + `MessageAttachment[]`, mrkdwn, severity badges). + + > **Action required**: Add `"@slack/web-api": "7.10.0"` to + > `packages/services/workflows/package.json` dependencies. It's currently only installed in the + > API package. Keep the version aligned with `packages/services/api/package.json`. + +- **Teams**: Raw `got.post()` to the channel's `webhook_endpoint` with a MessageCard JSON body. + Mirror the payload shape from the API's `TeamsCommunicationAdapter` (truncated to 27KB). +- **Email**: Use `context.email.send()` (added in Phase 2). + +Example messages: + +Firing (Slack): + +> :rotating_light: **Latency Alert: "API p99 Spike"** — Target: `my-target` in `my-project` p99 +> latency is **450ms** (was 200ms, +125%) — Threshold: above 200ms + +Resolved (Slack): + +> :white_check_mark: **Resolved: "API p99 Spike"** — p99 latency is now **180ms** (threshold: 200ms) + +Webhook payload: + +```json +{ + "type": "metric_alert", + "state": "firing", + "alert": { "name": "...", "type": "LATENCY", "metric": "p99", "severity": "warning" }, + "currentValue": 450, + "previousValue": 200, + "changePercent": 125, + "threshold": { "type": "FIXED_VALUE", "value": 200, "direction": "ABOVE" }, + "filter": { + "savedFilterId": "...", + "name": "My Filter", + "contents": { "operationIds": ["abc123"] } + }, + "target": { "slug": "..." }, + "project": { "slug": "..." }, + "organization": { "slug": "..." } +} +``` + +### 1.6 Alert State Log Retention + +Alert state log retention is plan-gated, following the same pattern as operations data retention +(see `packages/services/api/src/modules/commerce/constants.ts`): + +| Plan | State log retention | +| ---------- | ------------------- | +| Hobby | 7 days | +| Pro | 7 days | +| Enterprise | 30 days | + +When logging a state transition, the evaluation task looks up the organization's plan and sets +`expires_at` accordingly. A daily cron task (`purgeExpiredAlertStateLog`) deletes expired rows. + +**File:** `packages/services/api/src/modules/commerce/constants.ts` — add +`alertStateLogRetentionDays` to each plan's limits. + +### 1.7 Deployment + +**File:** `deployment/services/workflows.ts` — add ClickHouse env vars to the workflows service +deployment config. + +### Key files for Phase 1 + +| File | Change | +| ---------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- | +| `packages/migrations/src/actions/2026.04.15T00-00-01.metric-alert-rules.ts` | **New** — migration (rules, incidents, state log, **rule_channels join table**) | +| `packages/services/api/src/modules/alerts/providers/metric-alert-rules-storage.ts` | **New** — module-level CRUD provider (incl. setRuleChannels) | +| `packages/services/api/src/modules/alerts/module.graphql.ts` | Add MetricAlertRule types/mutations (multi-channel + savedFilter FK + eventCount + lastTriggeredAt) | +| `packages/services/api/src/modules/alerts/resolvers/` | New resolver files | +| `packages/services/api/src/modules/alerts/providers/alerts-manager.ts` | Add metric alert methods | +| `packages/services/workflows/package.json` | Add `@slack/web-api` dependency | +| `packages/services/workflows/src/environment.ts` | Add ClickHouse env vars | +| `packages/services/workflows/src/context.ts` | Add clickhouse to Context | +| `packages/services/workflows/src/lib/clickhouse-client.ts` | **New** — lightweight CH client | +| `packages/services/workflows/src/tasks/evaluate-metric-alert-rules.ts` | **New** — evaluation cron task | +| `packages/services/workflows/src/lib/metric-alert-notifier.ts` | **New** — notification sender | +| `packages/services/workflows/src/tasks/purge-expired-alert-state-log.ts` | **New** — daily purge task | +| `packages/services/workflows/src/index.ts` | Register tasks + crontab | +| `packages/services/api/src/modules/commerce/constants.ts` | Add alertStateLogRetentionDays | +| `deployment/services/workflows.ts` | ClickHouse env vars | + +--- + +## Phase 2: Email Alert Channel + +> **Status:** Deferred. This phase ships after Phase 1. Open design questions remain around how +> recipients are represented (free-form email strings vs. references to team members vs. a mix), and +> we want to gather feedback from the Phase 1 rollout before committing to a schema. + +Follows the exact same pattern used when MS Teams was added +(`2024.06.11T10-10-00.ms-teams-webhook.ts`). + +### 2.1 Database Migration + +**New file:** `packages/migrations/src/actions/2026.04.15T00-00-00.email-alert-channel.ts` + +```sql +ALTER TYPE alert_channel_type +ADD VALUE 'EMAIL'; + +-- Array of email addresses to support multiple recipients per channel +ALTER TABLE alert_channels +ADD COLUMN email_addresses TEXT []; +``` + +**Important:** This migration must use `noTransaction: true` because PostgreSQL does not allow +`ALTER TYPE ... ADD VALUE` inside a transaction block. The migration runner (`pg-migrator.ts`) wraps +migrations in transactions by default — the `noTransaction` flag opts out. There are 18+ existing +migrations using this pattern (e.g., index creation with `CREATE INDEX CONCURRENTLY`). Note: the +existing MS Teams migration (`2024.06.11T10-10-00.ms-teams-webhook.ts`) is missing this flag, which +is a latent bug. + +```typescript +export default { + name: '2026.04.15T00-00-00.email-alert-channel.ts', + noTransaction: true, + run: ({ sql }) => [ + { + name: 'Add EMAIL to alert_channel_type', + query: sql`ALTER TYPE alert_channel_type ADD VALUE 'EMAIL'` + }, + { + name: 'Add email_addresses column', + query: sql`ALTER TABLE alert_channels ADD COLUMN email_addresses TEXT[]` + } + ] +} satisfies MigrationExecutor +``` + +Register in `packages/migrations/src/run-pg-migrations.ts`. + +### 2.2 Data Access + +The legacy storage service (`packages/services/storage/src/index.ts`) will not be extended. Instead, +following the modern pattern used by recent modules (app-deployments, schema-proposals, +saved-filters), we create a module-level provider that injects `PG_POOL_CONFIG` directly. + +**New file:** `packages/services/api/src/modules/alerts/providers/alert-channels-storage.ts` + +```typescript +@Injectable({ scope: Scope.Operation }) +export class AlertChannelsStorage { + constructor(@Inject(PG_POOL_CONFIG) private pool: DatabasePool) {} + + async addAlertChannel(input: { ... emailAddresses?: string[] | null }) { ... } + async getAlertChannels(projectId: string) { ... } + async deleteAlertChannels(projectId: string, channelIds: string[]) { ... } +} +``` + +This provider takes over alert channel CRUD from the legacy storage module. Existing callers in +`AlertsManager` are updated to use this new provider instead. + +**File:** `packages/services/api/src/shared/entities.ts` — add `emailAddresses: string[] | null` to +`AlertChannel` interface. + +### 2.3 GraphQL API + +**File:** `packages/services/api/src/modules/alerts/module.graphql.ts` + +```graphql +# Add EMAIL to existing enum +enum AlertChannelType { SLACK, WEBHOOK, MSTEAMS_WEBHOOK, EMAIL } + +# New implementing type +type AlertEmailChannel implements AlertChannel { + id: ID! + name: String! + type: AlertChannelType! + emails: [String!]! +} + +# New input — accepts multiple recipients +input EmailChannelInput { + emails: [String!]! +} + +# Update AddAlertChannelInput to include email field +input AddAlertChannelInput { + ...existing fields... + email: EmailChannelInput +} +``` + +### 2.4 Resolvers + +**New file:** `packages/services/api/src/modules/alerts/resolvers/AlertEmailChannel.ts` + +```typescript +export const AlertEmailChannel: AlertEmailChannelResolvers = { + __isTypeOf: channel => channel.type === 'EMAIL', + emails: channel => channel.emailAddresses ?? [] +} +``` + +**File:** `packages/services/api/src/modules/alerts/resolvers/Mutation/addAlertChannel.ts` + +- Add Zod validation: + `email: MaybeModel(z.object({ emails: z.array(z.string().email().max(255)).min(1).max(10) }))` +- Pass `emailAddresses: input.email?.emails` to AlertsManager + +### 2.5 AlertsManager + +**File:** `packages/services/api/src/modules/alerts/providers/alerts-manager.ts` + +- Update `addChannel()` input to accept `emailAddresses?: string[] | null` +- Pass through to storage +- Update `triggerChannelConfirmation()` to handle EMAIL type +- Update `triggerSchemaChangeNotifications()` to dispatch via email adapter + +### 2.6 Email Communication Adapter + +**New file:** `packages/services/api/src/modules/alerts/providers/adapters/email.ts` + +Implements `CommunicationAdapter` interface. Uses `TaskScheduler` to schedule an email task in the +workflows service (same pattern as `WebhookCommunicationAdapter` which schedules +`SchemaChangeNotificationTask`). + +```typescript +@Injectable() +export class EmailCommunicationAdapter implements CommunicationAdapter { + constructor( + private taskScheduler: TaskScheduler, + private logger: Logger + ) {} + + async sendSchemaChangeNotification(input: SchemaChangeNotificationInput) { + await this.taskScheduler.scheduleTask(AlertEmailNotificationTask, { + recipients: input.channel.emailAddresses ?? [], + event: { + /* schema change details */ + } + }) + } + + async sendChannelConfirmation(input: ChannelConfirmationInput) { + await this.taskScheduler.scheduleTask(AlertEmailConfirmationTask, { + recipients: input.channel.emailAddresses ?? [], + event: input.event + }) + } +} +``` + +### 2.7 Email Task + Template (Workflows Service) + +**New file:** `packages/services/workflows/src/tasks/alert-email-notification.ts` + +- Define `AlertEmailNotificationTask` and `AlertEmailConfirmationTask` +- Send emails using `context.email.send()` with MJML templates (same pattern as + `email-verification.ts`) + +**New file:** `packages/services/workflows/src/lib/emails/templates/alert-notification.ts` + +- MJML email template for schema change notifications (use existing `email()`, `paragraph()`, + `button()` helpers from `components.ts`) + +**File:** `packages/services/workflows/src/index.ts` — register new task module + +### 2.8 Module Registration + +**File:** `packages/services/api/src/modules/alerts/index.ts` — add `EmailCommunicationAdapter` to +providers + +### 2.9 Frontend + +**File:** `packages/web/app/src/components/project/alerts/create-channel.tsx` + +- Add email addresses field with Zod validation (the existing form uses Yup + Formik, but new form + code should use Zod + react-hook-form to match the current codebase convention) +- Show email input when type === EMAIL +- Pass `email: { emails: values.emailAddresses }` in mutation + +**File:** `packages/web/app/src/components/project/alerts/channels-table.tsx` + +- Handle `AlertEmailChannel` typename to display the email addresses + +### Key files for Phase 2 + +| File | Change | +| -------------------------------------------------------------------------------- | -------------------------------------------------------------- | +| `packages/migrations/src/actions/2026.04.15T00-00-00.email-alert-channel.ts` | **New** — migration | +| `packages/migrations/src/run-pg-migrations.ts` | Register migration | +| `packages/services/api/src/modules/alerts/providers/alert-channels-storage.ts` | **New** — module-level CRUD provider (replaces legacy storage) | +| `packages/services/api/src/shared/entities.ts` | Add emailAddresses to AlertChannel | +| `packages/services/api/src/modules/alerts/module.graphql.ts` | Add AlertEmailChannel type + EmailChannelInput | +| `packages/services/api/src/modules/alerts/resolvers/AlertEmailChannel.ts` | **New** — resolver | +| `packages/services/api/src/modules/alerts/resolvers/Mutation/addAlertChannel.ts` | Add email validation + input | +| `packages/services/api/src/modules/alerts/providers/alerts-manager.ts` | Handle EMAIL in dispatch | +| `packages/services/api/src/modules/alerts/providers/adapters/email.ts` | **New** — adapter | +| `packages/services/api/src/modules/alerts/index.ts` | Register adapter | +| `packages/services/workflows/src/tasks/alert-email-notification.ts` | **New** — email task | +| `packages/services/workflows/src/lib/emails/templates/alert-notification.ts` | **New** — MJML template | +| `packages/services/workflows/src/index.ts` | Register task | +| `packages/web/app/src/components/project/alerts/create-channel.tsx` | Add email form field | +| `packages/web/app/src/components/project/alerts/channels-table.tsx` | Display email channels | + +--- + +## Open Question: Time Window Sizes and ClickHouse Data Retention + +The UI mockup shows "every 7d" as a time window option. Supporting larger windows is valuable — for +example, a user might set a weekly traffic alert to track projected monthly usage. However, larger +windows interact with ClickHouse data retention in ways worth discussing. + +### How alert evaluation works with ClickHouse + +Each alert evaluation compares two time windows: + +- **Current window**: the most recent N minutes of data +- **Previous window**: the N minutes before that (used as the baseline for comparison) + +This means the query looks back **2x the window size**. A 7-day alert looks 14 days back. + +### ClickHouse materialized view retention + +Our ClickHouse tables have different TTLs and granularities: + +| Table | Granularity | TTL | Max alert window (2x lookback) | +| --------------------- | ---------------- | -------- | ------------------------------ | +| `operations_minutely` | 1-minute buckets | 24 hours | ~6 hours | +| `operations_hourly` | 1-hour buckets | 30 days | ~14 days | +| `operations_daily` | 1-day buckets | 1 year | ~6 months | + +The evaluation engine automatically selects the appropriate table based on window size. + +### Tradeoffs with larger windows + +**Granularity vs. sensitivity**: Larger windows require coarser-grained tables. A 7-day alert uses +the hourly table, meaning data is aggregated in 1-hour buckets. A brief 10-minute latency spike +would be smoothed into an hourly average and might not trigger the alert. For use cases like weekly +traffic totals this is fine, but for spike detection shorter windows are more appropriate. + +**Evaluation frequency vs. window size**: The cron job runs every minute (see section 2.4). For a +7-day window, the result shifts by 1 minute out of 10,080 — consecutive evaluations produce nearly +identical values. This is harmless but slightly wasteful. A future optimization could scale +evaluation frequency with window size (e.g., hourly evaluation for daily/weekly alerts). + +**ClickHouse query cost**: The single-query optimization (fetching both windows in one +`CASE`-labeled query) works regardless of window size — it always returns 2 rows. Query cost scales +with the number of distinct `(target, filter)` groups, not with window length. At ~100 groups, +that's ~100 queries per minute, which is modest given ClickHouse's primary key efficiency (`target` +is the first key, timestamp is in the sort order, and daily partitions auto-prune irrelevant data). + +### Recommendation + +Support window sizes from **5 minutes up to 14 days** (20,160 minutes). This covers: + +- Short-term spike detection (5m–1h on minutely table) +- Medium-term trend monitoring (1h–6h on minutely table) +- Daily/weekly usage tracking (1d–14d on hourly table) + +The 14-day cap ensures the comparison window (28 days back) fits comfortably within the hourly +table's 30-day TTL. If there's demand for 30-day windows in the future, those would fall to the +daily table (1-year TTL) and could be added later. + +**Should we support a different range, or is 5 minutes to 14 days sufficient for V1?** + +--- + +## Open Question: Specialized ClickHouse Materialized View + +The existing `operations_minutely` table stores one row per +`(target, hash, client_name, client_version, minute)`: + +``` +ORDER BY (target, hash, client_name, client_version, timestamp) +``` + +For a target-wide alert (no operation/client filters), the evaluation query must aggregate across +all operation hashes and client combinations. A target with 500 operations and 10 client versions +produces ~5,000 rows per minute. For a 30-minute window comparing current vs. previous, the query +scans and merges **~300,000 rows** of `AggregateFunction` state. + +### Would a target-level MV help? + +A specialized materialized view pre-aggregated at the target level: + +```sql +CREATE MATERIALIZED VIEW DEFAULT.operations_target_minutely ( + target LowCardinality (STRING) CODEC (ZSTD (1)), + timestamp DateTime ('UTC') CODEC (DoubleDelta, LZ4), + total UInt32 CODEC (T64, ZSTD (1)), + total_ok UInt32 CODEC (T64, ZSTD (1)), + duration_avg AggregateFunction (AVG, UInt64) CODEC (ZSTD (1)), + duration_quantiles AggregateFunction (quantiles (0.75, 0.9, 0.95, 0.99), UInt64) CODEC (ZSTD (1)) +) ENGINE = SummingMergeTree PRIMARY KEY (target) +ORDER BY + (target, timestamp) TTL timestamp + INTERVAL 24 HOUR AS +SELECT + target, + toStartOfMinute (timestamp) AS timestamp, + count() AS total, + sum(ok) AS total_ok, + avgState (duration) AS duration_avg, + quantilesState (0.75, 0.9, 0.95, 0.99) (duration) AS duration_quantiles +FROM + DEFAULT.operations +GROUP BY + target, + timestamp +``` + +This collapses the 5,000 rows/minute down to **1 row per (target, minute)**. The same 30-minute +alert query would scan ~60 rows instead of ~300,000. + +### Tradeoffs + +**Benefits:** + +- Dramatically fewer rows to scan for target-wide alerts (the common case) +- Simpler, faster merges — no hash/client dimensions to aggregate across +- Smaller on-disk footprint for this view + +**Costs:** + +- Additional write amplification — every INSERT into `operations` triggers one more MV + materialization +- Rules scoped to a saved filter (operation/client filters) can't use this view — they still need + `operations_minutely` to filter by `hash` or `client_name`/`client_version` +- One more table to maintain and migrate + +### Recommendation + +For V1, we could start with the existing `operations_minutely` table and measure actual query +performance. The primary key starts with `target`, so ClickHouse can efficiently skip irrelevant +data even without a dedicated view. If we observe query latency issues at scale, we add the +specialized MV as an optimization. + +Alternatively, if we expect many target-wide alerts (likely the majority), adding the MV upfront +avoids a future ClickHouse migration. + +**Should we add a target-level MV now, or start with existing tables and optimize later?** + +--- + +## Performance & Scaling + +### Evaluation architecture + +A **single cron job** runs every minute and evaluates **all** enabled metric alerts. It does not +create one job per alert. The process: + +1. One PostgreSQL query fetches all enabled alerts (joined with channels, targets, orgs) +2. Alerts are grouped by `(target_id, time_window_minutes, filter)` — alerts sharing the same group + are served by a **single ClickHouse query** +3. Each ClickHouse query returns both time windows and all metrics (latency percentiles, totals, ok + counts) in one result set, serving multiple alert types simultaneously + +### How it scales + +Query count scales with **unique groups**, not alert count. Three alerts on the same target (e.g., +latency + errors + traffic with the same window and no filter) cost exactly one ClickHouse query. + +| Scenario | Configured alerts | Unique groups | CH queries/tick | Est. time (5 concurrent) | +| -------- | ----------------- | ------------- | --------------- | ------------------------ | +| Small | 10 | ~5 | 5 | ~10ms | +| Medium | 100 | ~30 | 30 | ~60ms | +| Large | 1,000 | ~150 | 150 | ~300ms | + +Even at 1,000 alerts, evaluation completes in well under a second. The practical ceiling is the +ClickHouse connection pool (32 sockets) and the 60-second task timeout. + +### Safeguards + +- **Task timeout**: 60 seconds. Graphile-Worker deduplicates cron tasks, so an overrunning + evaluation won't spawn a second concurrent instance. +- **Query timeout**: 10 seconds per ClickHouse query (matching existing timeouts in + `operations-reader.ts`). +- **Bounded concurrency**: ClickHouse queries execute with `p-queue` concurrency of 5 to avoid + saturating the connection pool. + +--- + +## Verification + +### Phase 1 + +1. Run migration, verify `metric_alert_rules` table created +2. CRUD metric alerts via GraphQL playground +3. Trigger `evaluateMetricAlertRules` task manually, verify ClickHouse queries and state transitions +4. Create a webhook metric alert, simulate threshold breach, verify webhook receives payload +5. Add integration test in `integration-tests/tests/api/project/alerts.spec.ts` + +### Phase 2 + +1. Run migration, verify `alert_channel_type` enum has `EMAIL` and `email_addresses` column exists +2. Create an EMAIL channel via GraphQL playground, verify it persists +3. Create a schema-change alert using the EMAIL channel, publish a schema change, verify email is + sent +4. Verify frontend form shows email option and validates correctly diff --git a/.prettierignore b/.prettierignore index 9cd07451949..a5b6096633a 100644 --- a/.prettierignore +++ b/.prettierignore @@ -26,6 +26,7 @@ pnpm-lock.yaml .bob/ .changeset/ +.claude/ CHANGELOG.md # temp volumes diff --git a/integration-tests/testkit/flow.ts b/integration-tests/testkit/flow.ts index b34f1070e00..478bf0f81a5 100644 --- a/integration-tests/testkit/flow.ts +++ b/integration-tests/testkit/flow.ts @@ -2,6 +2,7 @@ import { graphql } from './gql'; import type { AddAlertChannelInput, AddAlertInput, + AddMetricAlertRuleInput, AnswerOrganizationTransferRequestInput, AssignMemberRoleInput, CreateMemberRoleInput, @@ -11,6 +12,7 @@ import type { CreateTargetInput, CreateTokenInput, DeleteMemberRoleInput, + DeleteMetricAlertRulesInput, DeleteTokensInput, Experimental__UpdateTargetSchemaCompositionInput, InviteToOrganizationByEmailInput, @@ -24,6 +26,7 @@ import type { TargetSelectorInput, UpdateBaseSchemaInput, UpdateMemberRoleInput, + UpdateMetricAlertRuleInput, UpdateOrganizationSlugInput, UpdateProjectSlugInput, UpdateSchemaCompositionInput, @@ -647,6 +650,92 @@ export function addAlert(input: AddAlertInput, authToken: string) { }); } +export function addMetricAlertRule(input: AddMetricAlertRuleInput, authToken: string) { + return execute({ + document: graphql(` + mutation IntegrationTests_AddMetricAlertRule($input: AddMetricAlertRuleInput!) { + addMetricAlertRule(input: $input) { + ok { + addedMetricAlertRule { + id + name + type + metric + thresholdType + thresholdValue + direction + severity + state + timeWindowMinutes + confirmationMinutes + enabled + channels { + id + name + type + } + } + } + error { + message + } + } + } + `), + variables: { input }, + authToken, + }); +} + +export function updateMetricAlertRule(input: UpdateMetricAlertRuleInput, authToken: string) { + return execute({ + document: graphql(` + mutation IntegrationTests_UpdateMetricAlertRule($input: UpdateMetricAlertRuleInput!) { + updateMetricAlertRule(input: $input) { + ok { + updatedMetricAlertRule { + id + name + type + metric + thresholdType + thresholdValue + direction + severity + state + enabled + } + } + error { + message + } + } + } + `), + variables: { input }, + authToken, + }); +} + +export function deleteMetricAlertRules(input: DeleteMetricAlertRulesInput, authToken: string) { + return execute({ + document: graphql(` + mutation IntegrationTests_DeleteMetricAlertRules($input: DeleteMetricAlertRulesInput!) { + deleteMetricAlertRules(input: $input) { + ok { + deletedMetricAlertRuleIds + } + error { + message + } + } + } + `), + variables: { input }, + authToken, + }); +} + export function readOrganizationInfo( selector: { organizationSlug: string; diff --git a/integration-tests/testkit/seed.ts b/integration-tests/testkit/seed.ts index 71821303a53..b98d16a2b1b 100644 --- a/integration-tests/testkit/seed.ts +++ b/integration-tests/testkit/seed.ts @@ -19,6 +19,7 @@ import { ensureEnv } from './env'; import { addAlert, addAlertChannel, + addMetricAlertRule, assignMemberRole, checkSchema, compareToPreviousVersion, @@ -30,6 +31,7 @@ import { createTarget, createToken, deleteMemberRole, + deleteMetricAlertRules, deleteSchema, deleteTokens, fetchLatestSchema, @@ -51,6 +53,7 @@ import { readTokenInfo, updateBaseSchema, updateMemberRole, + updateMetricAlertRule, updateTargetValidationSettings, } from './flow'; import * as GraphQLSchema from './gql/graphql'; @@ -814,6 +817,31 @@ export function initSeed() { ); return result.addAlertChannel; }, + async addMetricAlertRule( + input: { token?: string } & Parameters[0], + ) { + const result = await addMetricAlertRule(input, input.token || ownerToken).then( + r => r.expectNoGraphQLErrors(), + ); + return result.addMetricAlertRule; + }, + async updateMetricAlertRule( + input: { token?: string } & Parameters[0], + ) { + const result = await updateMetricAlertRule(input, input.token || ownerToken).then( + r => r.expectNoGraphQLErrors(), + ); + return result.updateMetricAlertRule; + }, + async deleteMetricAlertRules( + input: { token?: string } & Parameters[0], + ) { + const result = await deleteMetricAlertRules( + input, + input.token || ownerToken, + ).then(r => r.expectNoGraphQLErrors()); + return result.deleteMetricAlertRules; + }, /** * Create an access token for a given target. * This token can be used for usage reporting and all actions that would be performed by the CLI. diff --git a/integration-tests/tests/api/project/metric-alerts.spec.ts b/integration-tests/tests/api/project/metric-alerts.spec.ts new file mode 100644 index 00000000000..64afb433267 --- /dev/null +++ b/integration-tests/tests/api/project/metric-alerts.spec.ts @@ -0,0 +1,227 @@ +import 'reflect-metadata'; +import { + AlertChannelType, + MetricAlertRuleDirection, + MetricAlertRuleMetric, + MetricAlertRuleSeverity, + MetricAlertRuleThresholdType, + MetricAlertRuleType, + ProjectType, +} from 'testkit/gql/graphql'; +import { initSeed } from '../../../testkit/seed'; + +test.concurrent('can create, read, update, and delete a metric alert rule', async ({ expect }) => { + const { createOrg } = await initSeed().createOwner(); + const { createProject, organization } = await createOrg(); + const { + project, + target, + addAlertChannel, + addMetricAlertRule, + updateMetricAlertRule, + deleteMetricAlertRules, + } = await createProject(ProjectType.Single); + + const organizationSlug = organization.slug; + const projectSlug = project.slug; + const targetSlug = target.slug; + + // Create a webhook channel to attach to the rule + const channelResult = await addAlertChannel({ + name: 'test-webhook', + organizationSlug, + projectSlug, + type: AlertChannelType.Webhook, + webhook: { endpoint: 'http://localhost:9876/webhook' }, + }); + expect(channelResult.ok).toBeTruthy(); + const channelId = channelResult.ok!.addedAlertChannel.id; + + // Create a metric alert rule + const addResult = await addMetricAlertRule({ + organizationSlug, + projectSlug, + targetSlug, + name: 'P99 Latency Spike', + type: MetricAlertRuleType.Latency, + metric: MetricAlertRuleMetric.P99, + timeWindowMinutes: 30, + thresholdType: MetricAlertRuleThresholdType.FixedValue, + thresholdValue: 200, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Critical, + channelIds: [channelId], + }); + expect(addResult.ok).toBeTruthy(); + expect(addResult.error).toBeNull(); + + const rule = addResult.ok!.addedMetricAlertRule; + expect(rule.name).toBe('P99 Latency Spike'); + expect(rule.type).toBe('LATENCY'); + expect(rule.metric).toBe('p99'); + expect(rule.thresholdType).toBe('FIXED_VALUE'); + expect(rule.thresholdValue).toBe(200); + expect(rule.direction).toBe('ABOVE'); + expect(rule.severity).toBe('CRITICAL'); + expect(rule.state).toBe('NORMAL'); + expect(rule.timeWindowMinutes).toBe(30); + expect(rule.confirmationMinutes).toBe(0); + expect(rule.enabled).toBe(true); + expect(rule.channels).toHaveLength(1); + expect(rule.channels[0].id).toBe(channelId); + + // Update the rule + const updateResult = await updateMetricAlertRule({ + organizationSlug, + projectSlug, + ruleId: rule.id, + name: 'Updated Latency Alert', + thresholdValue: 300, + severity: MetricAlertRuleSeverity.Warning, + }); + expect(updateResult.ok).toBeTruthy(); + expect(updateResult.error).toBeNull(); + + const updated = updateResult.ok!.updatedMetricAlertRule; + expect(updated.name).toBe('Updated Latency Alert'); + expect(updated.thresholdValue).toBe(300); + expect(updated.severity).toBe('WARNING'); + // Unchanged fields should persist + expect(updated.type).toBe('LATENCY'); + expect(updated.metric).toBe('p99'); + + // Delete the rule + const deleteResult = await deleteMetricAlertRules({ + organizationSlug, + projectSlug, + ruleIds: [rule.id], + }); + expect(deleteResult.ok).toBeTruthy(); + expect(deleteResult.ok!.deletedMetricAlertRuleIds).toContain(rule.id); +}); + +test.concurrent( + 'validates that LATENCY type requires metric and non-LATENCY rejects it', + async ({ expect }) => { + const { createOrg } = await initSeed().createOwner(); + const { createProject, organization } = await createOrg(); + const { project, target, addAlertChannel, addMetricAlertRule } = await createProject( + ProjectType.Single, + ); + + const organizationSlug = organization.slug; + const projectSlug = project.slug; + const targetSlug = target.slug; + + const channelResult = await addAlertChannel({ + name: 'test-webhook', + organizationSlug, + projectSlug, + type: AlertChannelType.Webhook, + webhook: { endpoint: 'http://localhost:9876/webhook' }, + }); + const channelId = channelResult.ok!.addedAlertChannel.id; + + // LATENCY without metric should fail + const noMetricResult = await addMetricAlertRule({ + organizationSlug, + projectSlug, + targetSlug, + name: 'Bad Latency Alert', + type: MetricAlertRuleType.Latency, + timeWindowMinutes: 30, + thresholdType: MetricAlertRuleThresholdType.FixedValue, + thresholdValue: 200, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Warning, + channelIds: [channelId], + // metric intentionally omitted + }); + expect(noMetricResult.error).toBeTruthy(); + expect(noMetricResult.error!.message).toContain('Metric is required'); + + // ERROR_RATE with metric should fail + const withMetricResult = await addMetricAlertRule({ + organizationSlug, + projectSlug, + targetSlug, + name: 'Bad Error Rate Alert', + type: MetricAlertRuleType.ErrorRate, + metric: MetricAlertRuleMetric.P99, + timeWindowMinutes: 30, + thresholdType: MetricAlertRuleThresholdType.PercentageChange, + thresholdValue: 50, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Warning, + channelIds: [channelId], + }); + expect(withMetricResult.error).toBeTruthy(); + expect(withMetricResult.error!.message).toContain('should only be set for LATENCY'); + }, +); + +test.concurrent('requires at least one channel', async ({ expect }) => { + const { createOrg } = await initSeed().createOwner(); + const { createProject, organization } = await createOrg(); + const { project, target, addMetricAlertRule } = await createProject(ProjectType.Single); + + const result = await addMetricAlertRule({ + organizationSlug: organization.slug, + projectSlug: project.slug, + targetSlug: target.slug, + name: 'No Channels Alert', + type: MetricAlertRuleType.Traffic, + timeWindowMinutes: 5, + thresholdType: MetricAlertRuleThresholdType.FixedValue, + thresholdValue: 1000000, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Info, + channelIds: [], + }); + expect(result.error).toBeTruthy(); + expect(result.error!.message).toContain('At least one channel'); +}); + +test.concurrent('supports multiple channels on a single rule', async ({ expect }) => { + const { createOrg } = await initSeed().createOwner(); + const { createProject, organization } = await createOrg(); + const { project, target, addAlertChannel, addMetricAlertRule } = await createProject( + ProjectType.Single, + ); + + const organizationSlug = organization.slug; + const projectSlug = project.slug; + + // Create two channels + const channel1 = await addAlertChannel({ + name: 'webhook-1', + organizationSlug, + projectSlug, + type: AlertChannelType.Webhook, + webhook: { endpoint: 'http://localhost:9876/webhook1' }, + }); + const channel2 = await addAlertChannel({ + name: 'webhook-2', + organizationSlug, + projectSlug, + type: AlertChannelType.Webhook, + webhook: { endpoint: 'http://localhost:9876/webhook2' }, + }); + + const result = await addMetricAlertRule({ + organizationSlug, + projectSlug, + targetSlug: target.slug, + name: 'Multi-Channel Alert', + type: MetricAlertRuleType.Traffic, + timeWindowMinutes: 5, + thresholdType: MetricAlertRuleThresholdType.PercentageChange, + thresholdValue: 30, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Warning, + channelIds: [channel1.ok!.addedAlertChannel.id, channel2.ok!.addedAlertChannel.id], + }); + + expect(result.ok).toBeTruthy(); + expect(result.ok!.addedMetricAlertRule.channels).toHaveLength(2); +}); diff --git a/package.json b/package.json index f7593d6a92d..5298f258400 100644 --- a/package.json +++ b/package.json @@ -45,6 +45,7 @@ "release:version": "changeset version && pnpm build:libraries && pnpm --filter @graphql-hive/cli oclif:readme", "seed:app-deployments": "tsx scripts/seed-app-deployments.mts", "seed:insights": "tsx scripts/seed-insights.mts", + "seed:metric-alerts": "tsx scripts/seed-metric-alerts.mts", "seed:org": "tsx scripts/seed-organization.mts", "seed:schemas": "tsx scripts/seed-schemas.ts", "seed:usage": "tsx scripts/seed-usage.ts", diff --git a/packages/migrations/src/actions/2026.04.15T00-00-01.metric-alert-rules.ts b/packages/migrations/src/actions/2026.04.15T00-00-01.metric-alert-rules.ts new file mode 100644 index 00000000000..b212848939d --- /dev/null +++ b/packages/migrations/src/actions/2026.04.15T00-00-01.metric-alert-rules.ts @@ -0,0 +1,84 @@ +import { type MigrationExecutor } from '../pg-migrator'; + +export default { + name: '2026.04.15T00-00-01.metric-alert-rules.ts', + run: ({ psql }) => psql` +CREATE TYPE "metric_alert_type" AS ENUM ('LATENCY', 'ERROR_RATE', 'TRAFFIC'); +CREATE TYPE "metric_alert_metric" AS ENUM ('avg', 'p75', 'p90', 'p95', 'p99'); +CREATE TYPE "metric_alert_threshold_type" AS ENUM ('FIXED_VALUE', 'PERCENTAGE_CHANGE'); +CREATE TYPE "metric_alert_direction" AS ENUM ('ABOVE', 'BELOW'); +CREATE TYPE "metric_alert_severity" AS ENUM ('INFO', 'WARNING', 'CRITICAL'); +CREATE TYPE "metric_alert_state" AS ENUM ('NORMAL', 'PENDING', 'FIRING', 'RECOVERING'); + +CREATE TABLE "metric_alert_rules" ( + "id" uuid NOT NULL DEFAULT uuid_generate_v4(), + "organization_id" uuid NOT NULL REFERENCES "organizations"("id") ON DELETE CASCADE, + "project_id" uuid NOT NULL REFERENCES "projects"("id") ON DELETE CASCADE, + "target_id" uuid NOT NULL REFERENCES "targets"("id") ON DELETE CASCADE, + "created_by_user_id" uuid REFERENCES "users"("id") ON DELETE SET NULL, + "type" "metric_alert_type" NOT NULL, + "time_window_minutes" integer NOT NULL DEFAULT 30, + "metric" "metric_alert_metric", + "threshold_type" "metric_alert_threshold_type" NOT NULL, + "threshold_value" double precision NOT NULL, + "direction" "metric_alert_direction" NOT NULL DEFAULT 'ABOVE', + "severity" "metric_alert_severity" NOT NULL DEFAULT 'WARNING', + "name" text NOT NULL, + "created_at" timestamptz NOT NULL DEFAULT now(), + "updated_at" timestamptz NOT NULL DEFAULT now(), + "enabled" boolean NOT NULL DEFAULT true, + "last_evaluated_at" timestamptz, + "last_triggered_at" timestamptz, + "state" "metric_alert_state" NOT NULL DEFAULT 'NORMAL', + "state_changed_at" timestamptz, + "confirmation_minutes" integer NOT NULL DEFAULT 0, + "saved_filter_id" uuid REFERENCES "saved_filters"("id") ON DELETE SET NULL, + PRIMARY KEY ("id"), + CONSTRAINT "metric_alert_rules_metric_required" CHECK ( + ("type" = 'LATENCY' AND "metric" IS NOT NULL) OR ("type" != 'LATENCY' AND "metric" IS NULL) + ) +); + +CREATE INDEX "idx_metric_alert_rules_enabled" ON "metric_alert_rules" ("enabled") WHERE "enabled" = true; + +CREATE TABLE "metric_alert_rule_channels" ( + "metric_alert_rule_id" uuid NOT NULL REFERENCES "metric_alert_rules"("id") ON DELETE CASCADE, + "alert_channel_id" uuid NOT NULL REFERENCES "alert_channels"("id") ON DELETE CASCADE, + PRIMARY KEY ("metric_alert_rule_id", "alert_channel_id") +); + +CREATE INDEX "idx_metric_alert_rule_channels_channel" ON "metric_alert_rule_channels" ("alert_channel_id"); + +CREATE TABLE "metric_alert_incidents" ( + "id" uuid NOT NULL DEFAULT uuid_generate_v4(), + "metric_alert_rule_id" uuid NOT NULL REFERENCES "metric_alert_rules"("id") ON DELETE CASCADE, + "started_at" timestamptz NOT NULL DEFAULT now(), + "resolved_at" timestamptz, + "current_value" double precision NOT NULL, + "previous_value" double precision, + "threshold_value" double precision NOT NULL, + PRIMARY KEY ("id") +); + +CREATE INDEX "idx_metric_alert_incidents_rule" ON "metric_alert_incidents" ("metric_alert_rule_id"); +CREATE INDEX "idx_metric_alert_incidents_open" ON "metric_alert_incidents" ("metric_alert_rule_id") WHERE "resolved_at" IS NULL; + +CREATE TABLE "metric_alert_state_log" ( + "id" uuid NOT NULL DEFAULT uuid_generate_v4(), + "metric_alert_rule_id" uuid NOT NULL REFERENCES "metric_alert_rules"("id") ON DELETE CASCADE, + "target_id" uuid NOT NULL REFERENCES "targets"("id") ON DELETE CASCADE, + "from_state" "metric_alert_state" NOT NULL, + "to_state" "metric_alert_state" NOT NULL, + "value" double precision, + "previous_value" double precision, + "threshold_value" double precision, + "created_at" timestamptz NOT NULL DEFAULT now(), + "expires_at" timestamptz NOT NULL, + PRIMARY KEY ("id") +); + +CREATE INDEX "idx_metric_alert_state_log_rule" ON "metric_alert_state_log" ("metric_alert_rule_id", "created_at"); +CREATE INDEX "idx_metric_alert_state_log_target" ON "metric_alert_state_log" ("target_id", "created_at"); +CREATE INDEX "idx_metric_alert_state_log_expires" ON "metric_alert_state_log" ("expires_at"); +`, +} satisfies MigrationExecutor; diff --git a/packages/migrations/src/run-pg-migrations.ts b/packages/migrations/src/run-pg-migrations.ts index 7e8d35806d5..ac87094827e 100644 --- a/packages/migrations/src/run-pg-migrations.ts +++ b/packages/migrations/src/run-pg-migrations.ts @@ -184,5 +184,6 @@ export const runPGMigrations = async (args: { slonik: PostgresDatabasePool; runT await import('./actions/2026.02.24T00-00-00.proposal-composition'), await import('./actions/2026.02.25T00-00-00.oidc-integration-domains'), await import('./actions/2026.03.25T00-00-00.access-token-expiration'), + await import('./actions/2026.04.15T00-00-01.metric-alert-rules'), ], }); diff --git a/packages/services/api/src/modules/alerts/index.ts b/packages/services/api/src/modules/alerts/index.ts index 471d4890ccc..7f1f613a491 100644 --- a/packages/services/api/src/modules/alerts/index.ts +++ b/packages/services/api/src/modules/alerts/index.ts @@ -3,6 +3,7 @@ import { TeamsCommunicationAdapter } from './providers/adapters/msteams'; import { SlackCommunicationAdapter } from './providers/adapters/slack'; import { WebhookCommunicationAdapter } from './providers/adapters/webhook'; import { AlertsManager } from './providers/alerts-manager'; +import { MetricAlertRulesStorage } from './providers/metric-alert-rules-storage'; import { resolvers } from './resolvers.generated'; import typeDefs from './module.graphql'; @@ -13,6 +14,7 @@ export const alertsModule = createModule({ resolvers, providers: [ AlertsManager, + MetricAlertRulesStorage, SlackCommunicationAdapter, WebhookCommunicationAdapter, TeamsCommunicationAdapter, diff --git a/packages/services/api/src/modules/alerts/module.graphql.mappers.ts b/packages/services/api/src/modules/alerts/module.graphql.mappers.ts index c34397be64d..ff1d30feb73 100644 --- a/packages/services/api/src/modules/alerts/module.graphql.mappers.ts +++ b/packages/services/api/src/modules/alerts/module.graphql.mappers.ts @@ -1,7 +1,16 @@ -import type { Alert, AlertChannel } from '../../shared/entities'; +import type { + Alert, + AlertChannel, + MetricAlertIncident, + MetricAlertRule, + MetricAlertStateLogEntry, +} from '../../shared/entities'; export type AlertChannelMapper = AlertChannel; export type AlertSlackChannelMapper = AlertChannel; export type AlertWebhookChannelMapper = AlertChannel; export type TeamsWebhookChannelMapper = AlertChannel; export type AlertMapper = Alert; +export type MetricAlertRuleMapper = MetricAlertRule; +export type MetricAlertRuleIncidentMapper = MetricAlertIncident; +export type MetricAlertRuleStateChangeMapper = MetricAlertStateLogEntry; diff --git a/packages/services/api/src/modules/alerts/module.graphql.ts b/packages/services/api/src/modules/alerts/module.graphql.ts index 2338f6e2c4d..2d8da406321 100644 --- a/packages/services/api/src/modules/alerts/module.graphql.ts +++ b/packages/services/api/src/modules/alerts/module.graphql.ts @@ -154,4 +154,216 @@ export default gql` channel: AlertChannel! target: Target! } + + # --- Metric Alert Rules --- + + enum MetricAlertRuleType { + LATENCY + ERROR_RATE + TRAFFIC + } + + enum MetricAlertRuleMetric { + avg + p75 + p90 + p95 + p99 + } + + enum MetricAlertRuleThresholdType { + FIXED_VALUE + PERCENTAGE_CHANGE + } + + enum MetricAlertRuleDirection { + ABOVE + BELOW + } + + enum MetricAlertRuleSeverity { + INFO + WARNING + CRITICAL + } + + enum MetricAlertRuleState { + NORMAL + PENDING + FIRING + RECOVERING + } + + type MetricAlertRule { + id: ID! + name: String! + type: MetricAlertRuleType! + target: Target! + """ + Destinations that receive notifications when this rule fires or resolves. + """ + channels: [AlertChannel!]! + timeWindowMinutes: Int! + metric: MetricAlertRuleMetric + thresholdType: MetricAlertRuleThresholdType! + thresholdValue: Float! + direction: MetricAlertRuleDirection! + severity: MetricAlertRuleSeverity! + state: MetricAlertRuleState! + confirmationMinutes: Int! + enabled: Boolean! + lastEvaluatedAt: DateTime + """ + Most recent time this rule transitioned PENDING → FIRING (null if never fired). + """ + lastTriggeredAt: DateTime + createdAt: DateTime! + createdBy: User + """ + The saved filter that scopes this rule (null = applies to the whole target). + """ + savedFilter: SavedFilter + """ + Count of state transitions logged for this rule in the given time range. + """ + eventCount(from: DateTime!, to: DateTime!): Int! + """ + The currently open incident, if any. + """ + currentIncident: MetricAlertRuleIncident + """ + Past incidents for this alert rule. + """ + incidentHistory(limit: Int, offset: Int): [MetricAlertRuleIncident!]! + """ + State change history for this rule (powers the state timeline). + """ + stateLog(from: DateTime!, to: DateTime!): [MetricAlertRuleStateChange!]! + } + + type MetricAlertRuleIncident { + id: ID! + startedAt: DateTime! + resolvedAt: DateTime + currentValue: Float! + previousValue: Float + thresholdValue: Float! + } + + type MetricAlertRuleStateChange { + id: ID! + fromState: MetricAlertRuleState! + toState: MetricAlertRuleState! + """ + Metric value in the current window at transition time. + """ + value: Float + """ + Metric value in the previous (comparison) window at transition time. + """ + previousValue: Float + """ + Threshold value snapshotted at transition time (survives rule edits). + """ + thresholdValue: Float + createdAt: DateTime! + rule: MetricAlertRule! + } + + extend type Target { + """ + State changes across all alert rules for this target (powers the alert events chart + list). + """ + metricAlertRuleStateLog(from: DateTime!, to: DateTime!): [MetricAlertRuleStateChange!]! + } + + extend type Target { + metricAlertRules: [MetricAlertRule!]! + } + + extend type Mutation { + addMetricAlertRule(input: AddMetricAlertRuleInput!): AddMetricAlertRuleResult! + updateMetricAlertRule(input: UpdateMetricAlertRuleInput!): UpdateMetricAlertRuleResult! + deleteMetricAlertRules(input: DeleteMetricAlertRulesInput!): DeleteMetricAlertRulesResult! + } + + input AddMetricAlertRuleInput { + organizationSlug: String! + projectSlug: String! + targetSlug: String! + name: String! + type: MetricAlertRuleType! + timeWindowMinutes: Int! + metric: MetricAlertRuleMetric + thresholdType: MetricAlertRuleThresholdType! + thresholdValue: Float! + direction: MetricAlertRuleDirection! + severity: MetricAlertRuleSeverity! + confirmationMinutes: Int + channelIds: [ID!]! + savedFilterId: ID + } + + input UpdateMetricAlertRuleInput { + organizationSlug: String! + projectSlug: String! + ruleId: ID! + name: String + type: MetricAlertRuleType + timeWindowMinutes: Int + metric: MetricAlertRuleMetric + thresholdType: MetricAlertRuleThresholdType + thresholdValue: Float + direction: MetricAlertRuleDirection + severity: MetricAlertRuleSeverity + confirmationMinutes: Int + channelIds: [ID!] + savedFilterId: ID + enabled: Boolean + } + + input DeleteMetricAlertRulesInput { + organizationSlug: String! + projectSlug: String! + ruleIds: [ID!]! + } + + type AddMetricAlertRuleResult { + ok: AddMetricAlertRuleOk + error: AddMetricAlertRuleError + } + + type AddMetricAlertRuleOk { + addedMetricAlertRule: MetricAlertRule! + } + + type AddMetricAlertRuleError implements Error { + message: String! + } + + type UpdateMetricAlertRuleResult { + ok: UpdateMetricAlertRuleOk + error: UpdateMetricAlertRuleError + } + + type UpdateMetricAlertRuleOk { + updatedMetricAlertRule: MetricAlertRule! + } + + type UpdateMetricAlertRuleError implements Error { + message: String! + } + + type DeleteMetricAlertRulesResult { + ok: DeleteMetricAlertRulesOk + error: DeleteMetricAlertRulesError + } + + type DeleteMetricAlertRulesOk { + deletedMetricAlertRuleIds: [ID!]! + } + + type DeleteMetricAlertRulesError implements Error { + message: String! + } `; diff --git a/packages/services/api/src/modules/alerts/providers/metric-alert-rules-storage.ts b/packages/services/api/src/modules/alerts/providers/metric-alert-rules-storage.ts new file mode 100644 index 00000000000..2ba3ef4e010 --- /dev/null +++ b/packages/services/api/src/modules/alerts/providers/metric-alert-rules-storage.ts @@ -0,0 +1,504 @@ +import { Injectable, Scope } from 'graphql-modules'; +import * as zod from 'zod'; +import { PostgresDatabasePool, psql } from '@hive/postgres'; +import type { + MetricAlertIncident, + MetricAlertRule, + MetricAlertRuleState, + MetricAlertStateLogEntry, +} from '../../../shared/entities'; + +const MetricAlertRuleModel = zod + .object({ + id: zod.string(), + organizationId: zod.string(), + projectId: zod.string(), + targetId: zod.string(), + createdByUserId: zod.string().nullable(), + type: zod.enum(['LATENCY', 'ERROR_RATE', 'TRAFFIC']), + timeWindowMinutes: zod.number(), + metric: zod.enum(['avg', 'p75', 'p90', 'p95', 'p99']).nullable(), + thresholdType: zod.enum(['FIXED_VALUE', 'PERCENTAGE_CHANGE']), + thresholdValue: zod.number(), + direction: zod.enum(['ABOVE', 'BELOW']), + severity: zod.enum(['INFO', 'WARNING', 'CRITICAL']), + name: zod.string(), + createdAt: zod.string(), + updatedAt: zod.string(), + enabled: zod.boolean(), + lastEvaluatedAt: zod.string().nullable(), + lastTriggeredAt: zod.string().nullable(), + state: zod.enum(['NORMAL', 'PENDING', 'FIRING', 'RECOVERING']), + stateChangedAt: zod.string().nullable(), + confirmationMinutes: zod.number(), + savedFilterId: zod.string().nullable(), + }) + .refine(data => (data.type === 'LATENCY') === (data.metric !== null), { + message: 'metric must be set for LATENCY type and null for other types', + }); + +const MetricAlertIncidentModel = zod.object({ + id: zod.string(), + metricAlertRuleId: zod.string(), + startedAt: zod.string(), + resolvedAt: zod.string().nullable(), + currentValue: zod.number(), + previousValue: zod.number().nullable(), + thresholdValue: zod.number(), +}); + +const MetricAlertStateLogModel = zod.object({ + id: zod.string(), + metricAlertRuleId: zod.string(), + targetId: zod.string(), + fromState: zod.enum(['NORMAL', 'PENDING', 'FIRING', 'RECOVERING']), + toState: zod.enum(['NORMAL', 'PENDING', 'FIRING', 'RECOVERING']), + value: zod.number().nullable(), + previousValue: zod.number().nullable(), + thresholdValue: zod.number().nullable(), + createdAt: zod.string(), + expiresAt: zod.string(), +}); + +const METRIC_ALERT_RULE_SELECT = psql` + "id" + , "organization_id" as "organizationId" + , "project_id" as "projectId" + , "target_id" as "targetId" + , "created_by_user_id" as "createdByUserId" + , "type" + , "time_window_minutes" as "timeWindowMinutes" + , "metric" + , "threshold_type" as "thresholdType" + , "threshold_value" as "thresholdValue" + , "direction" + , "severity" + , "name" + , to_json("created_at") as "createdAt" + , to_json("updated_at") as "updatedAt" + , "enabled" + , to_json("last_evaluated_at") as "lastEvaluatedAt" + , to_json("last_triggered_at") as "lastTriggeredAt" + , "state" + , to_json("state_changed_at") as "stateChangedAt" + , "confirmation_minutes" as "confirmationMinutes" + , "saved_filter_id" as "savedFilterId" +`; + +const METRIC_ALERT_INCIDENT_SELECT = psql` + "id" + , "metric_alert_rule_id" as "metricAlertRuleId" + , to_json("started_at") as "startedAt" + , to_json("resolved_at") as "resolvedAt" + , "current_value" as "currentValue" + , "previous_value" as "previousValue" + , "threshold_value" as "thresholdValue" +`; + +const METRIC_ALERT_STATE_LOG_SELECT = psql` + "id" + , "metric_alert_rule_id" as "metricAlertRuleId" + , "target_id" as "targetId" + , "from_state" as "fromState" + , "to_state" as "toState" + , "value" + , "previous_value" as "previousValue" + , "threshold_value" as "thresholdValue" + , to_json("created_at") as "createdAt" + , to_json("expires_at") as "expiresAt" +`; + +@Injectable({ + scope: Scope.Operation, +}) +export class MetricAlertRulesStorage { + constructor(private pool: PostgresDatabasePool) {} + + // --- Alert Rule CRUD --- + + async getMetricAlertRule(args: { id: string }): Promise { + const result = await this.pool.maybeOne(psql`/* getMetricAlertRule */ + SELECT ${METRIC_ALERT_RULE_SELECT} + FROM "metric_alert_rules" + WHERE "id" = ${args.id} + `); + + if (result === null) { + return null; + } + + return MetricAlertRuleModel.parse(result) as MetricAlertRule; + } + + async getMetricAlertRules(args: { projectId: string }): Promise { + const result = await this.pool.any(psql`/* getMetricAlertRules */ + SELECT ${METRIC_ALERT_RULE_SELECT} + FROM "metric_alert_rules" + WHERE "project_id" = ${args.projectId} + ORDER BY "created_at" DESC + `); + + return result.map(row => MetricAlertRuleModel.parse(row) as MetricAlertRule); + } + + async getMetricAlertRulesByTarget(args: { targetId: string }): Promise { + const result = await this.pool.any(psql`/* getMetricAlertRulesByTarget */ + SELECT ${METRIC_ALERT_RULE_SELECT} + FROM "metric_alert_rules" + WHERE "target_id" = ${args.targetId} + ORDER BY "created_at" DESC + `); + + return result.map(row => MetricAlertRuleModel.parse(row) as MetricAlertRule); + } + + async getAllEnabledMetricAlertRules(): Promise { + const result = await this.pool.any(psql`/* getAllEnabledMetricAlertRules */ + SELECT ${METRIC_ALERT_RULE_SELECT} + FROM "metric_alert_rules" + WHERE "enabled" = true + ORDER BY "id" + `); + + return result.map(row => MetricAlertRuleModel.parse(row) as MetricAlertRule); + } + + async addMetricAlertRule(args: { + organizationId: string; + projectId: string; + targetId: string; + createdByUserId: string | null; + type: MetricAlertRule['type']; + timeWindowMinutes: number; + metric: MetricAlertRule['metric']; + thresholdType: MetricAlertRule['thresholdType']; + thresholdValue: number; + direction: MetricAlertRule['direction']; + severity: MetricAlertRule['severity']; + name: string; + confirmationMinutes: number; + savedFilterId: string | null; + }): Promise { + const result = await this.pool.one(psql`/* addMetricAlertRule */ + INSERT INTO "metric_alert_rules" ( + "organization_id" + , "project_id" + , "target_id" + , "created_by_user_id" + , "type" + , "time_window_minutes" + , "metric" + , "threshold_type" + , "threshold_value" + , "direction" + , "severity" + , "name" + , "confirmation_minutes" + , "saved_filter_id" + ) + VALUES ( + ${args.organizationId} + , ${args.projectId} + , ${args.targetId} + , ${args.createdByUserId} + , ${args.type} + , ${args.timeWindowMinutes} + , ${args.metric} + , ${args.thresholdType} + , ${args.thresholdValue} + , ${args.direction} + , ${args.severity} + , ${args.name} + , ${args.confirmationMinutes} + , ${args.savedFilterId} + ) + RETURNING ${METRIC_ALERT_RULE_SELECT} + `); + + return MetricAlertRuleModel.parse(result) as MetricAlertRule; + } + + async updateMetricAlertRule(args: { + id: string; + type?: MetricAlertRule['type']; + timeWindowMinutes?: number; + metric?: MetricAlertRule['metric']; + thresholdType?: MetricAlertRule['thresholdType']; + thresholdValue?: number; + direction?: MetricAlertRule['direction']; + severity?: MetricAlertRule['severity']; + name?: string; + confirmationMinutes?: number; + savedFilterId?: string | null; + enabled?: boolean; + }): Promise { + const result = await this.pool.maybeOne(psql`/* updateMetricAlertRule */ + UPDATE "metric_alert_rules" + SET + "type" = COALESCE(${args.type ?? null}, "type") + , "time_window_minutes" = COALESCE(${args.timeWindowMinutes ?? null}, "time_window_minutes") + , "metric" = COALESCE(${args.metric ?? null}, "metric") + , "threshold_type" = COALESCE(${args.thresholdType ?? null}, "threshold_type") + , "threshold_value" = COALESCE(${args.thresholdValue ?? null}, "threshold_value") + , "direction" = COALESCE(${args.direction ?? null}, "direction") + , "severity" = COALESCE(${args.severity ?? null}, "severity") + , "name" = COALESCE(${args.name ?? null}, "name") + , "confirmation_minutes" = COALESCE(${args.confirmationMinutes ?? null}, "confirmation_minutes") + , "saved_filter_id" = COALESCE(${args.savedFilterId ?? null}, "saved_filter_id") + , "enabled" = COALESCE(${args.enabled ?? null}, "enabled") + , "updated_at" = NOW() + WHERE + "id" = ${args.id} + RETURNING ${METRIC_ALERT_RULE_SELECT} + `); + + if (result === null) { + return null; + } + + return MetricAlertRuleModel.parse(result) as MetricAlertRule; + } + + async deleteMetricAlertRules(args: { + projectId: string; + ruleIds: string[]; + }): Promise { + const result = await this.pool.any(psql`/* deleteMetricAlertRules */ + DELETE FROM "metric_alert_rules" + WHERE + "project_id" = ${args.projectId} + AND "id" = ANY(${psql.array(args.ruleIds, 'uuid')}) + RETURNING ${METRIC_ALERT_RULE_SELECT} + `); + + return result.map(row => MetricAlertRuleModel.parse(row) as MetricAlertRule); + } + + // --- Rule Channels (many-to-many) --- + + async setRuleChannels(args: { ruleId: string; channelIds: string[] }): Promise { + await this.pool.transaction('setRuleChannels', async trx => { + await trx.query(psql` + DELETE FROM "metric_alert_rule_channels" + WHERE "metric_alert_rule_id" = ${args.ruleId} + `); + + if (args.channelIds.length > 0) { + await trx.query(psql` + INSERT INTO "metric_alert_rule_channels" ("metric_alert_rule_id", "alert_channel_id") + SELECT ${args.ruleId}, unnest(${psql.array(args.channelIds, 'uuid')}) + `); + } + }); + } + + async getRuleChannelIds(args: { ruleId: string }): Promise { + const result = await this.pool.anyFirst(psql`/* getRuleChannelIds */ + SELECT "alert_channel_id" + FROM "metric_alert_rule_channels" + WHERE "metric_alert_rule_id" = ${args.ruleId} + `); + + return result.map(id => zod.string().parse(id)); + } + + // --- Alert State Updates (used by evaluation engine) --- + + async updateRuleState(args: { + id: string; + state: MetricAlertRuleState; + stateChangedAt?: Date; + lastEvaluatedAt?: Date; + lastTriggeredAt?: Date; + }): Promise { + await this.pool.query(psql`/* updateRuleState */ + UPDATE "metric_alert_rules" + SET + "state" = ${args.state} + , "state_changed_at" = COALESCE(${args.stateChangedAt?.toISOString() ?? null}, "state_changed_at") + , "last_evaluated_at" = COALESCE(${args.lastEvaluatedAt?.toISOString() ?? null}, "last_evaluated_at") + , "last_triggered_at" = COALESCE(${args.lastTriggeredAt?.toISOString() ?? null}, "last_triggered_at") + , "updated_at" = NOW() + WHERE + "id" = ${args.id} + `); + } + + // --- Incidents --- + + async createIncident(args: { + ruleId: string; + currentValue: number; + previousValue: number | null; + thresholdValue: number; + }): Promise { + const result = await this.pool.one(psql`/* createIncident */ + INSERT INTO "metric_alert_incidents" ( + "metric_alert_rule_id" + , "current_value" + , "previous_value" + , "threshold_value" + ) + VALUES ( + ${args.ruleId} + , ${args.currentValue} + , ${args.previousValue} + , ${args.thresholdValue} + ) + RETURNING ${METRIC_ALERT_INCIDENT_SELECT} + `); + + return MetricAlertIncidentModel.parse(result) as MetricAlertIncident; + } + + async resolveIncident(args: { ruleId: string }): Promise { + const result = await this.pool.maybeOne(psql`/* resolveIncident */ + UPDATE "metric_alert_incidents" + SET "resolved_at" = NOW() + WHERE + "metric_alert_rule_id" = ${args.ruleId} + AND "resolved_at" IS NULL + RETURNING ${METRIC_ALERT_INCIDENT_SELECT} + `); + + if (result === null) { + return null; + } + + return MetricAlertIncidentModel.parse(result) as MetricAlertIncident; + } + + async getOpenIncident(args: { ruleId: string }): Promise { + const result = await this.pool.maybeOne(psql`/* getOpenIncident */ + SELECT ${METRIC_ALERT_INCIDENT_SELECT} + FROM "metric_alert_incidents" + WHERE + "metric_alert_rule_id" = ${args.ruleId} + AND "resolved_at" IS NULL + `); + + if (result === null) { + return null; + } + + return MetricAlertIncidentModel.parse(result) as MetricAlertIncident; + } + + async getIncidentHistory(args: { + ruleId: string; + limit: number; + offset: number; + }): Promise { + const result = await this.pool.any(psql`/* getIncidentHistory */ + SELECT ${METRIC_ALERT_INCIDENT_SELECT} + FROM "metric_alert_incidents" + WHERE "metric_alert_rule_id" = ${args.ruleId} + ORDER BY "started_at" DESC + LIMIT ${args.limit} + OFFSET ${args.offset} + `); + + return result.map(row => MetricAlertIncidentModel.parse(row) as MetricAlertIncident); + } + + // --- State Log --- + + async logStateTransition(args: { + ruleId: string; + targetId: string; + fromState: MetricAlertRuleState; + toState: MetricAlertRuleState; + value: number | null; + previousValue: number | null; + thresholdValue: number | null; + expiresAt: Date; + }): Promise { + const result = await this.pool.one(psql`/* logStateTransition */ + INSERT INTO "metric_alert_state_log" ( + "metric_alert_rule_id" + , "target_id" + , "from_state" + , "to_state" + , "value" + , "previous_value" + , "threshold_value" + , "expires_at" + ) + VALUES ( + ${args.ruleId} + , ${args.targetId} + , ${args.fromState} + , ${args.toState} + , ${args.value} + , ${args.previousValue} + , ${args.thresholdValue} + , ${args.expiresAt.toISOString()} + ) + RETURNING ${METRIC_ALERT_STATE_LOG_SELECT} + `); + + return MetricAlertStateLogModel.parse(result) as MetricAlertStateLogEntry; + } + + async getStateLog(args: { + ruleId: string; + from: Date; + to: Date; + }): Promise { + const result = await this.pool.any(psql`/* getStateLog */ + SELECT ${METRIC_ALERT_STATE_LOG_SELECT} + FROM "metric_alert_state_log" + WHERE + "metric_alert_rule_id" = ${args.ruleId} + AND "created_at" >= ${args.from.toISOString()} + AND "created_at" <= ${args.to.toISOString()} + ORDER BY "created_at" DESC + `); + + return result.map(row => MetricAlertStateLogModel.parse(row) as MetricAlertStateLogEntry); + } + + async getStateLogByTarget(args: { + targetId: string; + from: Date; + to: Date; + }): Promise { + const result = await this.pool.any(psql`/* getStateLogByTarget */ + SELECT ${METRIC_ALERT_STATE_LOG_SELECT} + FROM "metric_alert_state_log" + WHERE + "target_id" = ${args.targetId} + AND "created_at" >= ${args.from.toISOString()} + AND "created_at" <= ${args.to.toISOString()} + ORDER BY "created_at" DESC + `); + + return result.map(row => MetricAlertStateLogModel.parse(row) as MetricAlertStateLogEntry); + } + + async getEventCount(args: { ruleId: string; from: Date; to: Date }): Promise { + const result = await this.pool.oneFirst(psql`/* getEventCount */ + SELECT count(*)::int + FROM "metric_alert_state_log" + WHERE + "metric_alert_rule_id" = ${args.ruleId} + AND "created_at" >= ${args.from.toISOString()} + AND "created_at" <= ${args.to.toISOString()} + `); + + return zod.number().parse(result); + } + + async purgeExpiredStateLog(): Promise { + const result = await this.pool.oneFirst(psql`/* purgeExpiredStateLog */ + WITH deleted AS ( + DELETE FROM "metric_alert_state_log" + WHERE "expires_at" < NOW() + RETURNING 1 + ) + SELECT count(*)::int FROM deleted + `); + + return zod.number().parse(result); + } +} diff --git a/packages/services/api/src/modules/alerts/resolvers/MetricAlertRule.ts b/packages/services/api/src/modules/alerts/resolvers/MetricAlertRule.ts new file mode 100644 index 00000000000..16386425687 --- /dev/null +++ b/packages/services/api/src/modules/alerts/resolvers/MetricAlertRule.ts @@ -0,0 +1,62 @@ +import { SavedFiltersStorage } from '../../saved-filters/providers/saved-filters-storage'; +import { Storage } from '../../shared/providers/storage'; +import { TargetManager } from '../../target/providers/target-manager'; +import { AlertsManager } from '../providers/alerts-manager'; +import { MetricAlertRulesStorage } from '../providers/metric-alert-rules-storage'; +import type { MetricAlertRuleResolvers } from './../../../__generated__/types'; + +export const MetricAlertRule: MetricAlertRuleResolvers = { + createdBy: (rule, _, { injector }) => { + if (!rule.createdByUserId) { + return null; + } + return injector.get(Storage).getUserById({ id: rule.createdByUserId }); + }, + target: (rule, _, { injector }) => { + return injector.get(TargetManager).getTarget({ + targetId: rule.targetId, + projectId: rule.projectId, + organizationId: rule.organizationId, + }); + }, + channels: async (rule, _, { injector }) => { + const channelIds = await injector.get(MetricAlertRulesStorage).getRuleChannelIds({ + ruleId: rule.id, + }); + const allChannels = await injector.get(AlertsManager).getChannels({ + organizationId: rule.organizationId, + projectId: rule.projectId, + }); + return allChannels.filter(c => channelIds.includes(c.id)); + }, + savedFilter: (rule, _, { injector }) => { + if (!rule.savedFilterId) { + return null; + } + return injector.get(SavedFiltersStorage).getSavedFilter({ id: rule.savedFilterId }); + }, + eventCount: (rule, { from, to }, { injector }) => { + return injector.get(MetricAlertRulesStorage).getEventCount({ + ruleId: rule.id, + from: new Date(from), + to: new Date(to), + }); + }, + currentIncident: (rule, _, { injector }) => { + return injector.get(MetricAlertRulesStorage).getOpenIncident({ ruleId: rule.id }); + }, + incidentHistory: (rule, { limit, offset }, { injector }) => { + return injector.get(MetricAlertRulesStorage).getIncidentHistory({ + ruleId: rule.id, + limit: limit ?? 20, + offset: offset ?? 0, + }); + }, + stateLog: (rule, { from, to }, { injector }) => { + return injector.get(MetricAlertRulesStorage).getStateLog({ + ruleId: rule.id, + from: new Date(from), + to: new Date(to), + }); + }, +}; diff --git a/packages/services/api/src/modules/alerts/resolvers/MetricAlertRuleIncident.ts b/packages/services/api/src/modules/alerts/resolvers/MetricAlertRuleIncident.ts new file mode 100644 index 00000000000..f7611a509f5 --- /dev/null +++ b/packages/services/api/src/modules/alerts/resolvers/MetricAlertRuleIncident.ts @@ -0,0 +1,14 @@ +import type { MetricAlertRuleIncidentResolvers } from './../../../__generated__/types'; + +/* + * Note: This object type is generated because "MetricAlertRuleIncidentMapper" is declared. This is to ensure runtime safety. + * + * When a mapper is used, it is possible to hit runtime errors in some scenarios: + * - given a field name, the schema type's field type does not match mapper's field type + * - or a schema type's field does not exist in the mapper's fields + * + * If you want to skip this file generation, remove the mapper or update the pattern in the `resolverGeneration.object` config. + */ +export const MetricAlertRuleIncident: MetricAlertRuleIncidentResolvers = { + /* Implement MetricAlertRuleIncident resolver logic here */ +}; diff --git a/packages/services/api/src/modules/alerts/resolvers/MetricAlertRuleStateChange.ts b/packages/services/api/src/modules/alerts/resolvers/MetricAlertRuleStateChange.ts new file mode 100644 index 00000000000..a7e3c27544e --- /dev/null +++ b/packages/services/api/src/modules/alerts/resolvers/MetricAlertRuleStateChange.ts @@ -0,0 +1,16 @@ +import { MetricAlertRulesStorage } from '../providers/metric-alert-rules-storage'; +import type { MetricAlertRuleStateChangeResolvers } from './../../../__generated__/types'; + +export const MetricAlertRuleStateChange: MetricAlertRuleStateChangeResolvers = { + rule: async (entry, _, { injector }) => { + const rule = await injector + .get(MetricAlertRulesStorage) + .getMetricAlertRule({ id: entry.metricAlertRuleId }); + + if (!rule) { + throw new Error(`Metric alert rule ${entry.metricAlertRuleId} not found`); + } + + return rule; + }, +}; diff --git a/packages/services/api/src/modules/alerts/resolvers/Mutation/addMetricAlertRule.ts b/packages/services/api/src/modules/alerts/resolvers/Mutation/addMetricAlertRule.ts new file mode 100644 index 00000000000..2e2112b4466 --- /dev/null +++ b/packages/services/api/src/modules/alerts/resolvers/Mutation/addMetricAlertRule.ts @@ -0,0 +1,71 @@ +import { Session } from '../../../auth/lib/authz'; +import { IdTranslator } from '../../../shared/providers/id-translator'; +import { MetricAlertRulesStorage } from '../../providers/metric-alert-rules-storage'; +import type { MutationResolvers } from './../../../../__generated__/types'; + +export const addMetricAlertRule: NonNullable = async ( + _, + { input }, + { injector, session }, +) => { + const translator = injector.get(IdTranslator); + const [organizationId, projectId, targetId] = await Promise.all([ + translator.translateOrganizationId(input), + translator.translateProjectId(input), + translator.translateTargetId(input), + ]); + + await injector.get(Session).assertPerformAction({ + action: 'alert:modify', + organizationId, + params: { organizationId, projectId }, + }); + + const currentUser = await session.getViewer(); + + if (input.type === 'LATENCY' && !input.metric) { + return { + error: { message: 'Metric is required for LATENCY alert type.' }, + }; + } + + if (input.type !== 'LATENCY' && input.metric) { + return { + error: { message: 'Metric should only be set for LATENCY alert type.' }, + }; + } + + if (input.channelIds.length === 0) { + return { + error: { message: 'At least one channel is required.' }, + }; + } + + const storage = injector.get(MetricAlertRulesStorage); + + const rule = await storage.addMetricAlertRule({ + organizationId, + projectId, + targetId, + createdByUserId: currentUser.id, + type: input.type, + timeWindowMinutes: input.timeWindowMinutes, + metric: input.metric ?? null, + thresholdType: input.thresholdType, + thresholdValue: input.thresholdValue, + direction: input.direction, + severity: input.severity, + name: input.name, + confirmationMinutes: input.confirmationMinutes ?? 0, + savedFilterId: input.savedFilterId ?? null, + }); + + await storage.setRuleChannels({ + ruleId: rule.id, + channelIds: [...input.channelIds], + }); + + return { + ok: { addedMetricAlertRule: rule }, + }; +}; diff --git a/packages/services/api/src/modules/alerts/resolvers/Mutation/deleteMetricAlertRules.ts b/packages/services/api/src/modules/alerts/resolvers/Mutation/deleteMetricAlertRules.ts new file mode 100644 index 00000000000..64dd6ea80fe --- /dev/null +++ b/packages/services/api/src/modules/alerts/resolvers/Mutation/deleteMetricAlertRules.ts @@ -0,0 +1,31 @@ +import { Session } from '../../../auth/lib/authz'; +import { IdTranslator } from '../../../shared/providers/id-translator'; +import { MetricAlertRulesStorage } from '../../providers/metric-alert-rules-storage'; +import type { MutationResolvers } from './../../../../__generated__/types'; + +export const deleteMetricAlertRules: NonNullable< + MutationResolvers['deleteMetricAlertRules'] +> = async (_, { input }, { injector }) => { + const translator = injector.get(IdTranslator); + const [organizationId, projectId] = await Promise.all([ + translator.translateOrganizationId(input), + translator.translateProjectId(input), + ]); + + await injector.get(Session).assertPerformAction({ + action: 'alert:modify', + organizationId, + params: { organizationId, projectId }, + }); + + const deleted = await injector.get(MetricAlertRulesStorage).deleteMetricAlertRules({ + projectId, + ruleIds: [...input.ruleIds], + }); + + return { + ok: { + deletedMetricAlertRuleIds: deleted.map(r => r.id), + }, + }; +}; diff --git a/packages/services/api/src/modules/alerts/resolvers/Mutation/updateMetricAlertRule.ts b/packages/services/api/src/modules/alerts/resolvers/Mutation/updateMetricAlertRule.ts new file mode 100644 index 00000000000..b0847453cc2 --- /dev/null +++ b/packages/services/api/src/modules/alerts/resolvers/Mutation/updateMetricAlertRule.ts @@ -0,0 +1,77 @@ +import { Session } from '../../../auth/lib/authz'; +import { IdTranslator } from '../../../shared/providers/id-translator'; +import { MetricAlertRulesStorage } from '../../providers/metric-alert-rules-storage'; +import type { MutationResolvers } from './../../../../__generated__/types'; + +export const updateMetricAlertRule: NonNullable< + MutationResolvers['updateMetricAlertRule'] +> = async (_, { input }, { injector }) => { + const translator = injector.get(IdTranslator); + const [organizationId, projectId] = await Promise.all([ + translator.translateOrganizationId(input), + translator.translateProjectId(input), + ]); + + await injector.get(Session).assertPerformAction({ + action: 'alert:modify', + organizationId, + params: { organizationId, projectId }, + }); + + const storage = injector.get(MetricAlertRulesStorage); + + const existing = await storage.getMetricAlertRule({ id: input.ruleId }); + if (!existing || existing.projectId !== projectId) { + return { + error: { message: 'Metric alert rule not found.' }, + }; + } + + // Validate metric constraint against the effective type after update + const effectiveType = input.type ?? existing.type; + const effectiveMetric = input.metric !== undefined ? input.metric : existing.metric; + + if (effectiveType === 'LATENCY' && !effectiveMetric) { + return { + error: { message: 'Metric is required for LATENCY alert type.' }, + }; + } + + if (effectiveType !== 'LATENCY' && effectiveMetric) { + return { + error: { message: 'Metric should only be set for LATENCY alert type.' }, + }; + } + + const rule = await storage.updateMetricAlertRule({ + id: input.ruleId, + type: input.type ?? undefined, + timeWindowMinutes: input.timeWindowMinutes ?? undefined, + metric: input.metric ?? undefined, + thresholdType: input.thresholdType ?? undefined, + thresholdValue: input.thresholdValue ?? undefined, + direction: input.direction ?? undefined, + severity: input.severity ?? undefined, + name: input.name ?? undefined, + confirmationMinutes: input.confirmationMinutes ?? undefined, + savedFilterId: input.savedFilterId ?? undefined, + enabled: input.enabled ?? undefined, + }); + + if (!rule) { + return { + error: { message: 'Failed to update metric alert rule.' }, + }; + } + + if (input.channelIds) { + await storage.setRuleChannels({ + ruleId: rule.id, + channelIds: [...input.channelIds], + }); + } + + return { + ok: { updatedMetricAlertRule: rule }, + }; +}; diff --git a/packages/services/api/src/modules/alerts/resolvers/Target.ts b/packages/services/api/src/modules/alerts/resolvers/Target.ts new file mode 100644 index 00000000000..22f5633cc3b --- /dev/null +++ b/packages/services/api/src/modules/alerts/resolvers/Target.ts @@ -0,0 +1,17 @@ +import { MetricAlertRulesStorage } from '../providers/metric-alert-rules-storage'; +import type { TargetResolvers } from './../../../__generated__/types'; + +export const Target: Pick = { + metricAlertRules: (target, _, { injector }) => { + return injector.get(MetricAlertRulesStorage).getMetricAlertRulesByTarget({ + targetId: target.id, + }); + }, + metricAlertRuleStateLog: (target, { from, to }, { injector }) => { + return injector.get(MetricAlertRulesStorage).getStateLogByTarget({ + targetId: target.id, + from: new Date(from), + to: new Date(to), + }); + }, +}; diff --git a/packages/services/api/src/shared/entities.ts b/packages/services/api/src/shared/entities.ts index 3b1de843912..cdda4972374 100644 --- a/packages/services/api/src/shared/entities.ts +++ b/packages/services/api/src/shared/entities.ts @@ -436,6 +436,61 @@ export interface Alert { createdAt: string; } +export type MetricAlertRuleType = 'LATENCY' | 'ERROR_RATE' | 'TRAFFIC'; +export type MetricAlertRuleMetric = 'avg' | 'p75' | 'p90' | 'p95' | 'p99'; +export type MetricAlertRuleThresholdType = 'FIXED_VALUE' | 'PERCENTAGE_CHANGE'; +export type MetricAlertRuleDirection = 'ABOVE' | 'BELOW'; +export type MetricAlertRuleSeverity = 'INFO' | 'WARNING' | 'CRITICAL'; +export type MetricAlertRuleState = 'NORMAL' | 'PENDING' | 'FIRING' | 'RECOVERING'; + +export interface MetricAlertRule { + id: string; + organizationId: string; + projectId: string; + targetId: string; + createdByUserId: string | null; + type: MetricAlertRuleType; + timeWindowMinutes: number; + metric: MetricAlertRuleMetric | null; + thresholdType: MetricAlertRuleThresholdType; + thresholdValue: number; + direction: MetricAlertRuleDirection; + severity: MetricAlertRuleSeverity; + name: string; + createdAt: string; + updatedAt: string; + enabled: boolean; + lastEvaluatedAt: string | null; + lastTriggeredAt: string | null; + state: MetricAlertRuleState; + stateChangedAt: string | null; + confirmationMinutes: number; + savedFilterId: string | null; +} + +export interface MetricAlertIncident { + id: string; + metricAlertRuleId: string; + startedAt: string; + resolvedAt: string | null; + currentValue: number; + previousValue: number | null; + thresholdValue: number; +} + +export interface MetricAlertStateLogEntry { + id: string; + metricAlertRuleId: string; + targetId: string; + fromState: MetricAlertRuleState; + toState: MetricAlertRuleState; + value: number | null; + previousValue: number | null; + thresholdValue: number | null; + createdAt: string; + expiresAt: string; +} + export interface AdminOrganizationStats { organization: Organization; versions: number; diff --git a/packages/services/storage/src/db/types.ts b/packages/services/storage/src/db/types.ts index 224b9403d7f..ae52d5ead18 100644 --- a/packages/services/storage/src/db/types.ts +++ b/packages/services/storage/src/db/types.ts @@ -159,15 +159,6 @@ export interface migration { name: string; } -export interface oidc_integration_domains { - created_at: Date; - domain_name: string; - id: string; - oidc_integration_id: string; - organization_id: string; - verified_at: Date | null; -} - export interface oidc_integrations { additional_scopes: Array | null; authorization_endpoint: string | null; @@ -191,7 +182,6 @@ export interface organization_access_tokens { assigned_resources: any | null; created_at: Date; description: string; - expires_at: Date | null; first_characters: string; hash: string; id: string; @@ -393,9 +383,6 @@ export interface schema_proposal_reviews { export interface schema_proposals { author: string; comments_count: number; - composition_status: string | null; - composition_status_reason: string | null; - composition_timestamp: Date | null; created_at: Date; description: string; id: string; @@ -540,7 +527,6 @@ export interface DBTables { email_verifications: email_verifications; graphile_worker_deduplication: graphile_worker_deduplication; migration: migration; - oidc_integration_domains: oidc_integration_domains; oidc_integrations: oidc_integrations; organization_access_tokens: organization_access_tokens; organization_invitations: organization_invitations; diff --git a/packages/services/workflows/package.json b/packages/services/workflows/package.json index ee208dad925..30f5cba4113 100644 --- a/packages/services/workflows/package.json +++ b/packages/services/workflows/package.json @@ -18,6 +18,7 @@ "@hive/service-common": "workspace:*", "@hive/storage": "workspace:*", "@sentry/node": "7.120.2", + "@slack/web-api": "7.10.0", "@trpc/client": "10.45.3", "@types/mjml": "4.7.1", "@types/nodemailer": "7.0.4", diff --git a/packages/services/workflows/src/context.ts b/packages/services/workflows/src/context.ts index cfc676545e2..8a5490f5ef0 100644 --- a/packages/services/workflows/src/context.ts +++ b/packages/services/workflows/src/context.ts @@ -1,6 +1,7 @@ import type { Logger } from '@graphql-hive/logger'; import { PostgresDatabasePool } from '@hive/postgres'; import type { HivePubSub } from '@hive/pubsub'; +import type { ClickHouseClient } from './lib/clickhouse-client.js'; import type { EmailProvider } from './lib/emails/providers.js'; import type { SchemaProvider } from './lib/schema/provider.js'; import type { RequestBroker } from './lib/webhooks/send-webhook.js'; @@ -10,6 +11,7 @@ export type Context = { email: EmailProvider; schema: SchemaProvider; pg: PostgresDatabasePool; + clickhouse: ClickHouseClient | null; requestBroker: RequestBroker | null; pubSub: HivePubSub; }; diff --git a/packages/services/workflows/src/environment.ts b/packages/services/workflows/src/environment.ts index 32ee8ee8be9..c0ff547a11a 100644 --- a/packages/services/workflows/src/environment.ts +++ b/packages/services/workflows/src/environment.ts @@ -90,6 +90,20 @@ const RedisModel = zod.object({ REDIS_TLS_ENABLED: emptyString(zod.union([zod.literal('1'), zod.literal('0')]).optional()), }); +const ClickHouseModel = zod.union([ + zod.object({ + CLICKHOUSE: emptyString(zod.literal('0').optional()), + }), + zod.object({ + CLICKHOUSE: zod.literal('1'), + CLICKHOUSE_HOST: zod.string(), + CLICKHOUSE_PORT: NumberFromString, + CLICKHOUSE_USERNAME: zod.string(), + CLICKHOUSE_PASSWORD: emptyString(zod.string().optional()), + CLICKHOUSE_PROTOCOL: emptyString(zod.string().optional()), + }), +]); + const RequestBrokerModel = zod.union([ zod.object({ REQUEST_BROKER: emptyString(zod.literal('0').optional()), @@ -134,6 +148,7 @@ const configs = { prometheus: PrometheusModel.safeParse(process.env), log: LogModel.safeParse(process.env), tracing: OpenTelemetryConfigurationModel.safeParse(process.env), + clickhouse: ClickHouseModel.safeParse(process.env), requestBroker: RequestBrokerModel.safeParse(process.env), redis: RedisModel.safeParse(process.env), }; @@ -166,6 +181,7 @@ const sentry = extractConfig(configs.sentry); const prometheus = extractConfig(configs.prometheus); const log = extractConfig(configs.log); const tracing = extractConfig(configs.tracing); +const clickhouse = extractConfig(configs.clickhouse); const requestBroker = extractConfig(configs.requestBroker); const redis = extractConfig(configs.redis); @@ -240,6 +256,16 @@ export const env = { user: postgres.POSTGRES_USER, } satisfies PostgresConnectionParamaters, }, + clickhouse: + clickhouse.CLICKHOUSE === '1' + ? { + host: clickhouse.CLICKHOUSE_HOST, + port: clickhouse.CLICKHOUSE_PORT, + username: clickhouse.CLICKHOUSE_USERNAME, + password: clickhouse.CLICKHOUSE_PASSWORD ?? '', + protocol: clickhouse.CLICKHOUSE_PROTOCOL, + } + : null, requestBroker: requestBroker.REQUEST_BROKER === '1' ? ({ diff --git a/packages/services/workflows/src/index.ts b/packages/services/workflows/src/index.ts index d14be17d574..c57db291ded 100644 --- a/packages/services/workflows/src/index.ts +++ b/packages/services/workflows/src/index.ts @@ -12,6 +12,7 @@ import { } from '@hive/service-common'; import { Context } from './context.js'; import { env } from './environment.js'; +import { ClickHouseClient } from './lib/clickhouse-client.js'; import { createEmailProvider } from './lib/emails/providers.js'; import { schemaProvider } from './lib/schema/provider.js'; import { bridgeFastifyLogger } from './logger.js'; @@ -43,6 +44,8 @@ const modules = await Promise.all([ import('./tasks/usage-rate-limit-exceeded.js'), import('./tasks/usage-rate-limit-warning.js'), import('./tasks/schema-proposal-composition.js'), + import('./tasks/evaluate-metric-alert-rules.js'), + import('./tasks/purge-expired-alert-state-log.js'), ]); const crontab = ` @@ -50,6 +53,10 @@ const crontab = ` 0 10 * * 0 purgeExpiredSchemaChecks # Every day at 3:00 AM 0 3 * * * purgeExpiredDedupeKeys + # Evaluate metric alert rules every minute + * * * * * evaluateMetricAlertRules + # Purge expired alert state log entries daily at 4:00 AM + 0 4 * * * purgeExpiredAlertStateLog `; const pg = await createPostgresDatabasePool({ @@ -86,10 +93,15 @@ const pubSub = createHivePubSub({ ), }); +const clickhouse = env.clickhouse + ? new ClickHouseClient(env.clickhouse, logger.child({ source: 'ClickHouse' })) + : null; + const context: Context = { logger, email: createEmailProvider(env.email.provider, env.email.emailFrom), pg, + clickhouse, requestBroker: env.requestBroker, schema: schemaProvider({ logger, diff --git a/packages/services/workflows/src/lib/clickhouse-client.ts b/packages/services/workflows/src/lib/clickhouse-client.ts new file mode 100644 index 00000000000..a187f022465 --- /dev/null +++ b/packages/services/workflows/src/lib/clickhouse-client.ts @@ -0,0 +1,50 @@ +import got from 'got'; +import type { Logger } from '@graphql-hive/logger'; + +export type ClickHouseConfig = { + host: string; + port: number; + username: string; + password: string; + protocol?: string; +}; + +export class ClickHouseClient { + private baseUrl: string; + + constructor( + private config: ClickHouseConfig, + private logger: Logger, + ) { + const protocol = config.protocol ?? 'http'; + this.baseUrl = `${protocol}://${config.host}:${config.port}`; + } + + async query>(sql: string): Promise { + this.logger.debug('Executing ClickHouse query'); + + const response = await got.post(this.baseUrl, { + searchParams: { + database: 'default', + default_format: 'JSON', + }, + headers: { + 'Content-Type': 'text/plain', + }, + username: this.config.username, + password: this.config.password, + body: sql, + timeout: { + request: 10_000, + }, + retry: { + limit: 3, + methods: ['POST'], + statusCodes: [502, 503, 504], + }, + }); + + const result = JSON.parse(response.body) as { data: T[] }; + return result.data; + } +} diff --git a/packages/services/workflows/src/lib/metric-alert-evaluator.ts b/packages/services/workflows/src/lib/metric-alert-evaluator.ts new file mode 100644 index 00000000000..0abec6893c4 --- /dev/null +++ b/packages/services/workflows/src/lib/metric-alert-evaluator.ts @@ -0,0 +1,376 @@ +import type { Logger } from '@graphql-hive/logger'; +import type { PostgresDatabasePool } from '@hive/postgres'; +import { psql } from '@hive/postgres'; +import type { ClickHouseClient } from './clickhouse-client.js'; + +export type MetricAlertRuleRow = { + id: string; + organizationId: string; + projectId: string; + targetId: string; + name: string; + type: 'LATENCY' | 'ERROR_RATE' | 'TRAFFIC'; + timeWindowMinutes: number; + metric: 'avg' | 'p75' | 'p90' | 'p95' | 'p99' | null; + thresholdType: 'FIXED_VALUE' | 'PERCENTAGE_CHANGE'; + thresholdValue: number; + direction: 'ABOVE' | 'BELOW'; + severity: 'INFO' | 'WARNING' | 'CRITICAL'; + state: 'NORMAL' | 'PENDING' | 'FIRING' | 'RECOVERING'; + stateChangedAt: string | null; + confirmationMinutes: number; + savedFilterId: string | null; +}; + +type ClickHouseWindowRow = { + window: 'current' | 'previous'; + total: string; + total_ok: string; + average: number; + percentiles: [number, number, number, number]; +}; + +type GroupKey = string; + +const UUID_RE = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i; + +function assertUUID(value: string): string { + if (!UUID_RE.test(value)) { + throw new Error(`Invalid UUID: ${value}`); + } + return value; +} + +function makeGroupKey(rule: MetricAlertRuleRow): GroupKey { + return `${rule.targetId}:${rule.timeWindowMinutes}:${rule.savedFilterId ?? ''}`; +} + +function extractMetricValue(row: ClickHouseWindowRow, rule: MetricAlertRuleRow): number { + const total = Number(row.total); + const totalOk = Number(row.total_ok); + + switch (rule.type) { + case 'TRAFFIC': + return total; + case 'ERROR_RATE': + return total > 0 ? ((total - totalOk) / total) * 100 : 0; + case 'LATENCY': { + const metricMap: Record = { + avg: row.average, + p75: row.percentiles[0], + p90: row.percentiles[1], + p95: row.percentiles[2], + p99: row.percentiles[3], + }; + return metricMap[rule.metric!] ?? 0; + } + } +} + +function isThresholdBreached( + currentValue: number, + previousValue: number, + rule: MetricAlertRuleRow, +): boolean { + let compareValue: number; + + if (rule.thresholdType === 'FIXED_VALUE') { + compareValue = currentValue; + } else { + if (previousValue === 0) { + if (currentValue === 0) return false; + compareValue = currentValue; + } else { + compareValue = ((currentValue - previousValue) / previousValue) * 100; + } + } + + return rule.direction === 'ABOVE' + ? compareValue > rule.thresholdValue + : compareValue < rule.thresholdValue; +} + +const ALERT_STATE_LOG_RETENTION_DAYS: Record = { + HOBBY: 7, + PRO: 7, + ENTERPRISE: 30, +}; + +async function getAlertStateLogRetentionDays( + pg: PostgresDatabasePool, + organizationId: string, +): Promise { + const result = await pg.maybeOneFirst(psql` + SELECT "plan_name" FROM "organizations" WHERE "id" = ${organizationId} + `); + const planName = typeof result === 'string' ? result : 'HOBBY'; + return ALERT_STATE_LOG_RETENTION_DAYS[planName] ?? 7; +} + +function hasElapsed(stateChangedAt: string | null, minutes: number): boolean { + if (!stateChangedAt) return true; + const changedAt = new Date(stateChangedAt).getTime(); + return Date.now() - changedAt >= minutes * 60_000; +} + +function formatClickHouseDate(date: Date): string { + return date.toISOString().replace('T', ' ').replace('Z', '').slice(0, 19); +} + +export async function fetchEnabledRules(pg: PostgresDatabasePool): Promise { + const result = await pg.any(psql` + SELECT + "id" + , "organization_id" as "organizationId" + , "project_id" as "projectId" + , "target_id" as "targetId" + , "name" + , "type" + , "time_window_minutes" as "timeWindowMinutes" + , "metric" + , "threshold_type" as "thresholdType" + , "threshold_value" as "thresholdValue" + , "direction" + , "severity" + , "state" + , to_json("state_changed_at") as "stateChangedAt" + , "confirmation_minutes" as "confirmationMinutes" + , "saved_filter_id" as "savedFilterId" + FROM "metric_alert_rules" + WHERE "enabled" = true + `); + + return result as unknown as MetricAlertRuleRow[]; +} + +export function groupRulesByQuery( + rules: MetricAlertRuleRow[], +): Map { + const groups = new Map(); + for (const rule of rules) { + const key = makeGroupKey(rule); + const group = groups.get(key); + if (group) { + group.push(rule); + } else { + groups.set(key, [rule]); + } + } + return groups; +} + +export async function queryClickHouseWindows( + clickhouse: ClickHouseClient, + targetId: string, + timeWindowMinutes: number, +): Promise<{ current: ClickHouseWindowRow | null; previous: ClickHouseWindowRow | null }> { + const now = Date.now(); + const offsetMs = 60_000; + const windowMs = timeWindowMinutes * 60_000; + + const currentWindowEnd = new Date(now - offsetMs); + const currentWindowStart = new Date(now - offsetMs - windowMs); + const previousWindowStart = new Date(now - offsetMs - 2 * windowMs); + + const tableName = timeWindowMinutes <= 360 ? 'operations_minutely' : 'operations_hourly'; + + const safeTargetId = assertUUID(targetId); + + // ClickHouse parameterized query syntax is not supported via the HTTP interface + // in the same way as PostgreSQL. We validate the UUID format above to prevent injection. + const sql = ` + SELECT + CASE + WHEN timestamp >= '${formatClickHouseDate(currentWindowStart)}' THEN 'current' + ELSE 'previous' + END as window, + sum(total) as total, + sum(total_ok) as total_ok, + avgMerge(duration_avg) as average, + quantilesMerge(0.75, 0.90, 0.95, 0.99)(duration_quantiles) as percentiles + FROM ${tableName} + WHERE target = '${safeTargetId}' + AND timestamp >= '${formatClickHouseDate(previousWindowStart)}' + AND timestamp < '${formatClickHouseDate(currentWindowEnd)}' + GROUP BY window + ORDER BY window + `; + + const rows = await clickhouse.query(sql); + + return { + current: rows.find(r => r.window === 'current') ?? null, + previous: rows.find(r => r.window === 'previous') ?? null, + }; +} + +export type OnStateTransition = (args: { + rule: MetricAlertRuleRow; + fromState: MetricAlertRuleRow['state']; + toState: MetricAlertRuleRow['state']; + currentValue: number; + previousValue: number; +}) => Promise; + +export async function evaluateRule(args: { + rule: MetricAlertRuleRow; + current: ClickHouseWindowRow; + previous: ClickHouseWindowRow; + pg: PostgresDatabasePool; + logger: Logger; + onTransition?: OnStateTransition; +}): Promise { + const { rule, current, previous, pg, logger, onTransition } = args; + const now = new Date(); + + const currentValue = extractMetricValue(current, rule); + const previousValue = extractMetricValue(previous, rule); + const breached = isThresholdBreached(currentValue, previousValue, rule); + + if (breached) { + switch (rule.state) { + case 'NORMAL': { + await updateState(pg, rule.id, 'PENDING', now); + await logTransition(pg, rule, 'NORMAL', 'PENDING', currentValue, previousValue); + logger.info({ ruleId: rule.id }, 'Alert rule entered PENDING state'); + break; + } + case 'PENDING': { + if ( + rule.confirmationMinutes === 0 || + hasElapsed(rule.stateChangedAt, rule.confirmationMinutes) + ) { + await updateState(pg, rule.id, 'FIRING', now, now); + await logTransition(pg, rule, 'PENDING', 'FIRING', currentValue, previousValue); + await pg.query(psql` + INSERT INTO "metric_alert_incidents" ( + "metric_alert_rule_id", "current_value", "previous_value", "threshold_value" + ) VALUES ( + ${rule.id}, ${currentValue}, ${previousValue}, ${rule.thresholdValue} + ) + `); + logger.info({ ruleId: rule.id }, 'Alert rule entered FIRING state'); + await onTransition?.({ + rule, + fromState: 'PENDING', + toState: 'FIRING', + currentValue, + previousValue, + }); + } + break; + } + case 'FIRING': { + break; + } + case 'RECOVERING': { + await updateState(pg, rule.id, 'FIRING', now); + await logTransition(pg, rule, 'RECOVERING', 'FIRING', currentValue, previousValue); + logger.info({ ruleId: rule.id }, 'Alert rule re-entered FIRING from RECOVERING'); + break; + } + } + } else { + switch (rule.state) { + case 'NORMAL': { + break; + } + case 'PENDING': { + await updateState(pg, rule.id, 'NORMAL', now); + await logTransition(pg, rule, 'PENDING', 'NORMAL', currentValue, previousValue); + logger.info( + { ruleId: rule.id }, + 'Alert rule returned to NORMAL from PENDING (false alarm)', + ); + break; + } + case 'FIRING': { + await updateState(pg, rule.id, 'RECOVERING', now); + await logTransition(pg, rule, 'FIRING', 'RECOVERING', currentValue, previousValue); + logger.info({ ruleId: rule.id }, 'Alert rule entered RECOVERING state'); + break; + } + case 'RECOVERING': { + if ( + rule.confirmationMinutes === 0 || + hasElapsed(rule.stateChangedAt, rule.confirmationMinutes) + ) { + await updateState(pg, rule.id, 'NORMAL', now); + await logTransition(pg, rule, 'RECOVERING', 'NORMAL', currentValue, previousValue); + await pg.query(psql` + UPDATE "metric_alert_incidents" + SET "resolved_at" = NOW() + WHERE "metric_alert_rule_id" = ${rule.id} AND "resolved_at" IS NULL + `); + logger.info({ ruleId: rule.id }, 'Alert rule resolved, back to NORMAL'); + await onTransition?.({ + rule, + fromState: 'RECOVERING', + toState: 'NORMAL', + currentValue, + previousValue, + }); + } + break; + } + } + } + + await pg.query(psql` + UPDATE "metric_alert_rules" + SET "last_evaluated_at" = NOW(), "updated_at" = NOW() + WHERE "id" = ${rule.id} + `); +} + +async function updateState( + pg: PostgresDatabasePool, + ruleId: string, + state: string, + stateChangedAt: Date, + lastTriggeredAt?: Date, +) { + await pg.query(psql` + UPDATE "metric_alert_rules" + SET + "state" = ${state} + , "state_changed_at" = ${stateChangedAt.toISOString()} + ${lastTriggeredAt ? psql`, "last_triggered_at" = ${lastTriggeredAt.toISOString()}` : psql``} + , "updated_at" = NOW() + WHERE "id" = ${ruleId} + `); +} + +async function logTransition( + pg: PostgresDatabasePool, + rule: MetricAlertRuleRow, + fromState: string, + toState: string, + value: number, + previousValue: number, +) { + const retentionDays = await getAlertStateLogRetentionDays(pg, rule.organizationId); + const expiresAt = new Date(Date.now() + retentionDays * 24 * 60 * 60 * 1000); + + await pg.query(psql` + INSERT INTO "metric_alert_state_log" ( + "metric_alert_rule_id" + , "target_id" + , "from_state" + , "to_state" + , "value" + , "previous_value" + , "threshold_value" + , "expires_at" + ) VALUES ( + ${rule.id} + , ${rule.targetId} + , ${fromState} + , ${toState} + , ${value} + , ${previousValue} + , ${rule.thresholdValue} + , ${expiresAt.toISOString()} + ) + `); +} diff --git a/packages/services/workflows/src/lib/metric-alert-notifier.ts b/packages/services/workflows/src/lib/metric-alert-notifier.ts new file mode 100644 index 00000000000..6fb83c28fc4 --- /dev/null +++ b/packages/services/workflows/src/lib/metric-alert-notifier.ts @@ -0,0 +1,274 @@ +import type { Logger } from '@graphql-hive/logger'; +import type { PostgresDatabasePool } from '@hive/postgres'; +import { psql } from '@hive/postgres'; +import { WebClient } from '@slack/web-api'; +import type { MetricAlertRuleRow } from './metric-alert-evaluator.js'; +import { sendWebhook, type RequestBroker } from './webhooks/send-webhook.js'; + +type AlertChannelRow = { + id: string; + type: 'SLACK' | 'WEBHOOK' | 'MSTEAMS_WEBHOOK'; + name: string; + slackChannel: string | null; + webhookEndpoint: string | null; +}; + +type NotificationEvent = { + state: 'firing' | 'resolved'; + rule: MetricAlertRuleRow; + currentValue: number; + previousValue: number; + organizationSlug: string; + projectSlug: string; + targetSlug: string; +}; + +export async function sendMetricAlertNotifications(args: { + ruleId: string; + event: NotificationEvent; + pg: PostgresDatabasePool; + requestBroker: RequestBroker | null; + logger: Logger; +}): Promise { + const { ruleId, event, pg, logger } = args; + + // Fetch channels attached to this rule + const channels = (await pg.any(psql` + SELECT + ac."id" + , ac."type" + , ac."name" + , ac."slack_channel" as "slackChannel" + , ac."webhook_endpoint" as "webhookEndpoint" + FROM "alert_channels" ac + INNER JOIN "metric_alert_rule_channels" marc + ON marc."alert_channel_id" = ac."id" + WHERE marc."metric_alert_rule_id" = ${ruleId} + `)) as unknown as AlertChannelRow[]; + + if (channels.length === 0) { + logger.warn({ ruleId }, 'No channels configured for metric alert rule'); + return; + } + + for (const channel of channels) { + try { + switch (channel.type) { + case 'SLACK': { + await sendSlackNotification({ channel, event, pg, logger }); + break; + } + case 'WEBHOOK': { + await sendWebhookNotification({ + channel, + event, + requestBroker: args.requestBroker, + logger, + }); + break; + } + case 'MSTEAMS_WEBHOOK': { + await sendTeamsNotification({ + channel, + event, + requestBroker: args.requestBroker, + logger, + }); + break; + } + } + } catch (error) { + logger.error( + { error, channelId: channel.id, channelType: channel.type }, + 'Failed to send metric alert notification', + ); + } + } +} + +async function sendSlackNotification(args: { + channel: AlertChannelRow; + event: NotificationEvent; + pg: PostgresDatabasePool; + logger: Logger; +}) { + const { channel, event, pg, logger } = args; + + if (!channel.slackChannel) { + logger.warn({ channelId: channel.id }, 'Slack channel name not configured'); + return; + } + + // Fetch the org's Slack token + const tokenResult = await pg.maybeOneFirst(psql` + SELECT "slack_token" + FROM "organizations" + WHERE "id" = ${event.rule.organizationId} + `); + + if (!tokenResult) { + logger.warn( + { organizationId: event.rule.organizationId }, + 'Slack integration not configured for organization', + ); + return; + } + + const token = tokenResult as string; + const client = new WebClient(token); + + const isFiring = event.state === 'firing'; + const emoji = isFiring ? ':rotating_light:' : ':white_check_mark:'; + const action = isFiring ? 'triggered' : 'resolved'; + const color = isFiring ? '#E74C3C' : '#2ECC71'; + + const changeText = formatChangeText(event); + + await client.chat.postMessage({ + channel: channel.slackChannel, + text: `${emoji} Metric alert ${action}: "${event.rule.name}"`, + attachments: [ + { + color, + blocks: [ + { + type: 'section', + text: { + type: 'mrkdwn', + text: [ + `*${event.rule.name}* — ${action}`, + `Type: ${event.rule.type} | Severity: ${event.rule.severity}`, + changeText, + `Target: \`${event.targetSlug}\` in \`${event.projectSlug}\``, + ].join('\n'), + }, + }, + ], + }, + ], + }); + + logger.debug({ channelId: channel.id }, 'Slack notification sent'); +} + +async function sendWebhookNotification(args: { + channel: AlertChannelRow; + event: NotificationEvent; + requestBroker: RequestBroker | null; + logger: Logger; +}) { + const { channel, event, logger } = args; + + if (!channel.webhookEndpoint) { + logger.warn({ channelId: channel.id }, 'Webhook endpoint not configured'); + return; + } + + const payload = buildWebhookPayload(event); + + await sendWebhook(logger, args.requestBroker, { + attempt: 0, + maxAttempts: 5, + endpoint: channel.webhookEndpoint, + data: payload, + }); + + logger.debug({ channelId: channel.id }, 'Webhook notification sent'); +} + +async function sendTeamsNotification(args: { + channel: AlertChannelRow; + event: NotificationEvent; + requestBroker: RequestBroker | null; + logger: Logger; +}) { + const { channel, event, logger } = args; + + if (!channel.webhookEndpoint) { + logger.warn({ channelId: channel.id }, 'Teams webhook endpoint not configured'); + return; + } + + const isFiring = event.state === 'firing'; + const emoji = isFiring ? '🔴' : '✅'; + const action = isFiring ? 'triggered' : 'resolved'; + const themeColor = isFiring ? 'E74C3C' : '2ECC71'; + + const changeText = formatChangeText(event); + + const card = { + '@type': 'MessageCard', + '@context': 'http://schema.org/extensions', + themeColor, + summary: `Metric alert ${action}: "${event.rule.name}"`, + sections: [ + { + activityTitle: `${emoji} ${event.rule.name} — ${action}`, + facts: [ + { name: 'Type', value: event.rule.type }, + { name: 'Severity', value: event.rule.severity }, + { name: 'Target', value: `${event.targetSlug} in ${event.projectSlug}` }, + ], + text: changeText, + }, + ], + }; + + await sendWebhook(logger, args.requestBroker, { + attempt: 0, + maxAttempts: 5, + endpoint: channel.webhookEndpoint, + data: card, + }); + + logger.debug({ channelId: channel.id }, 'Teams notification sent'); +} + +function formatChangeText(event: NotificationEvent): string { + const { rule, currentValue, previousValue } = event; + const unit = rule.type === 'LATENCY' ? 'ms' : rule.type === 'ERROR_RATE' ? '%' : ' requests'; + const metricLabel = + rule.type === 'LATENCY' + ? `${rule.metric} latency` + : rule.type === 'ERROR_RATE' + ? 'Error rate' + : 'Traffic'; + + if (event.state === 'firing') { + const changePercent = + previousValue !== 0 + ? (((currentValue - previousValue) / previousValue) * 100).toFixed(1) + : 'N/A'; + return `${metricLabel}: **${currentValue.toFixed(2)}${unit}** (was ${previousValue.toFixed(2)}${unit}, ${changePercent}% change) — Threshold: ${rule.direction.toLowerCase()} ${rule.thresholdValue}${rule.thresholdType === 'PERCENTAGE_CHANGE' ? '%' : unit}`; + } + + return `${metricLabel}: **${currentValue.toFixed(2)}${unit}** (threshold: ${rule.thresholdValue}${rule.thresholdType === 'PERCENTAGE_CHANGE' ? '%' : unit})`; +} + +function buildWebhookPayload(event: NotificationEvent) { + const { rule, currentValue, previousValue } = event; + const changePercent = + previousValue !== 0 ? ((currentValue - previousValue) / previousValue) * 100 : null; + + return { + type: 'metric_alert', + state: event.state, + alert: { + name: rule.name, + type: rule.type, + metric: rule.metric, + severity: rule.severity, + }, + currentValue, + previousValue, + changePercent, + threshold: { + type: rule.thresholdType, + value: rule.thresholdValue, + direction: rule.direction, + }, + target: { slug: event.targetSlug }, + project: { slug: event.projectSlug }, + organization: { slug: event.organizationSlug }, + }; +} diff --git a/packages/services/workflows/src/tasks/evaluate-metric-alert-rules.ts b/packages/services/workflows/src/tasks/evaluate-metric-alert-rules.ts new file mode 100644 index 00000000000..a3e7c096527 --- /dev/null +++ b/packages/services/workflows/src/tasks/evaluate-metric-alert-rules.ts @@ -0,0 +1,117 @@ +import { z } from 'zod'; +import { psql } from '@hive/postgres'; +import { defineTask, implementTask } from '../kit.js'; +import { + evaluateRule, + fetchEnabledRules, + groupRulesByQuery, + queryClickHouseWindows, +} from '../lib/metric-alert-evaluator.js'; +import { sendMetricAlertNotifications } from '../lib/metric-alert-notifier.js'; + +export const EvaluateMetricAlertRulesTask = defineTask({ + name: 'evaluateMetricAlertRules', + schema: z.unknown(), +}); + +export const task = implementTask(EvaluateMetricAlertRulesTask, async args => { + const { context, logger } = args; + + if (!context.clickhouse) { + logger.debug('ClickHouse not configured, skipping metric alert evaluation'); + return; + } + + const rules = await fetchEnabledRules(context.pg); + + if (rules.length === 0) { + logger.debug('No enabled metric alert rules found'); + return; + } + + logger.info({ count: rules.length }, 'Evaluating metric alert rules'); + + const groups = groupRulesByQuery(rules); + + for (const [, groupRules] of groups) { + const representative = groupRules[0]; + + let windows; + try { + windows = await queryClickHouseWindows( + context.clickhouse, + representative.targetId, + representative.timeWindowMinutes, + ); + } catch (error) { + logger.error( + { error, targetId: representative.targetId }, + 'Failed to query ClickHouse for alert evaluation', + ); + continue; + } + + if (!windows.current || !windows.previous) { + logger.debug( + { targetId: representative.targetId }, + 'Insufficient data for evaluation (need both current and previous windows)', + ); + for (const rule of groupRules) { + await context.pg.query(psql` + UPDATE "metric_alert_rules" + SET "last_evaluated_at" = NOW(), "updated_at" = NOW() + WHERE "id" = ${rule.id} + `); + } + continue; + } + + // Fetch slugs for notification messages (once per group since all share a target) + const slugs = (await context.pg.maybeOne(psql` + SELECT + o."slug" as "organizationSlug" + , p."slug" as "projectSlug" + , t."slug" as "targetSlug" + FROM "targets" t + INNER JOIN "projects" p ON p."id" = t."project_id" + INNER JOIN "organizations" o ON o."id" = p."org_id" + WHERE t."id" = ${representative.targetId} + `)) as { organizationSlug: string; projectSlug: string; targetSlug: string } | null; + + for (const rule of groupRules) { + await evaluateRule({ + rule, + current: windows.current, + previous: windows.previous, + pg: context.pg, + logger, + onTransition: async ({ rule: r, fromState, toState, currentValue, previousValue }) => { + if (!slugs) return; + + const isFiring = toState === 'FIRING'; + const isResolved = fromState === 'RECOVERING' && toState === 'NORMAL'; + + if (isFiring || isResolved) { + await sendMetricAlertNotifications({ + ruleId: r.id, + event: { + state: isFiring ? 'firing' : 'resolved', + rule: r, + currentValue, + previousValue, + organizationSlug: slugs.organizationSlug, + projectSlug: slugs.projectSlug, + targetSlug: slugs.targetSlug, + }, + pg: context.pg, + requestBroker: context.requestBroker, + logger, + }); + } + }, + }); + } + } + + logger.info('Metric alert evaluation complete'); +}); diff --git a/packages/services/workflows/src/tasks/purge-expired-alert-state-log.ts b/packages/services/workflows/src/tasks/purge-expired-alert-state-log.ts new file mode 100644 index 00000000000..f96f7db8cec --- /dev/null +++ b/packages/services/workflows/src/tasks/purge-expired-alert-state-log.ts @@ -0,0 +1,22 @@ +import { z } from 'zod'; +import { psql } from '@hive/postgres'; +import { defineTask, implementTask } from '../kit.js'; + +export const PurgeExpiredAlertStateLogTask = defineTask({ + name: 'purgeExpiredAlertStateLog', + schema: z.unknown(), +}); + +export const task = implementTask(PurgeExpiredAlertStateLogTask, async args => { + args.logger.debug('purging expired alert state log entries'); + const result = await args.context.pg.oneFirst(psql` + WITH "deleted" AS ( + DELETE FROM "metric_alert_state_log" + WHERE "expires_at" < NOW() + RETURNING 1 + ) + SELECT COUNT(*)::int FROM "deleted"; + `); + const amount = z.number().parse(result); + args.logger.debug({ purgedCount: amount }, 'finished purging expired alert state log entries'); +}); diff --git a/packages/web/app/src/components/layouts/target.tsx b/packages/web/app/src/components/layouts/target.tsx index 91b760102ca..1c0497e8afd 100644 --- a/packages/web/app/src/components/layouts/target.tsx +++ b/packages/web/app/src/components/layouts/target.tsx @@ -44,6 +44,7 @@ export enum Page { Laboratory = 'laboratory', Apps = 'apps', Proposals = 'proposals', + Alerts = 'alerts', Settings = 'settings', } @@ -216,6 +217,12 @@ export const TargetLayout = ({ to: '/$organizationSlug/$projectSlug/$targetSlug/proposals', params, }, + { + value: Page.Alerts, + label: 'Alerts', + to: '/$organizationSlug/$projectSlug/$targetSlug/alerts', + params, + }, { value: Page.Settings, label: 'Settings', diff --git a/packages/web/app/src/pages/target-alerts-activity.tsx b/packages/web/app/src/pages/target-alerts-activity.tsx new file mode 100644 index 00000000000..713b3581be1 --- /dev/null +++ b/packages/web/app/src/pages/target-alerts-activity.tsx @@ -0,0 +1,13 @@ +import { SubPageLayout, SubPageLayoutHeader } from '@/components/ui/page-content-layout'; + +export function TargetAlertsActivityPage() { + return ( + + +

Alert activity coming soon.

+
+ ); +} diff --git a/packages/web/app/src/pages/target-alerts-create.tsx b/packages/web/app/src/pages/target-alerts-create.tsx new file mode 100644 index 00000000000..49a55859476 --- /dev/null +++ b/packages/web/app/src/pages/target-alerts-create.tsx @@ -0,0 +1,13 @@ +import { SubPageLayout, SubPageLayoutHeader } from '@/components/ui/page-content-layout'; + +export function TargetAlertsCreatePage() { + return ( + + +

Create alert form coming soon.

+
+ ); +} diff --git a/packages/web/app/src/pages/target-alerts-detail.tsx b/packages/web/app/src/pages/target-alerts-detail.tsx new file mode 100644 index 00000000000..e633e7c1aa8 --- /dev/null +++ b/packages/web/app/src/pages/target-alerts-detail.tsx @@ -0,0 +1,18 @@ +import { SubPageLayout, SubPageLayoutHeader } from '@/components/ui/page-content-layout'; + +export function TargetAlertsDetailPage(props: { + organizationSlug: string; + projectSlug: string; + targetSlug: string; + ruleId: string; +}) { + return ( + + +

Alert detail view coming soon.

+
+ ); +} diff --git a/packages/web/app/src/pages/target-alerts-rules.tsx b/packages/web/app/src/pages/target-alerts-rules.tsx new file mode 100644 index 00000000000..6ece15e386a --- /dev/null +++ b/packages/web/app/src/pages/target-alerts-rules.tsx @@ -0,0 +1,13 @@ +import { SubPageLayout, SubPageLayoutHeader } from '@/components/ui/page-content-layout'; + +export function TargetAlertsRulesPage() { + return ( + + +

Alert rules list coming soon.

+
+ ); +} diff --git a/packages/web/app/src/pages/target-alerts.tsx b/packages/web/app/src/pages/target-alerts.tsx new file mode 100644 index 00000000000..9c363dff6c1 --- /dev/null +++ b/packages/web/app/src/pages/target-alerts.tsx @@ -0,0 +1,57 @@ +import { Link, Outlet } from '@tanstack/react-router'; +import { Button } from '@/components/ui/button'; +import { Meta } from '@/components/ui/meta'; +import { NavLayout, PageLayout, PageLayoutContent } from '@/components/ui/page-content-layout'; + +const navItems = [ + { label: 'Alert activity', segment: 'activity' }, + { label: 'Alert rules', segment: 'rules' }, + { label: 'Create a new alert', segment: 'create' }, +] as const; + +export function TargetAlertsPage(props: { + organizationSlug: string; + projectSlug: string; + targetSlug: string; +}) { + const params = { + organizationSlug: props.organizationSlug, + projectSlug: props.projectSlug, + targetSlug: props.targetSlug, + }; + + return ( + <> + + + + {navItems.map(item => ( + + ))} + + + + + + + ); +} diff --git a/packages/web/app/src/router.tsx b/packages/web/app/src/router.tsx index bc230a76527..cd34325e908 100644 --- a/packages/web/app/src/router.tsx +++ b/packages/web/app/src/router.tsx @@ -87,6 +87,11 @@ import { TargetLaboratoryPage as TargetLaboratoryPageNew } from './pages/target- import { ProposalTab, TargetProposalsSinglePage } from './pages/target-proposal'; import { TargetProposalsPage } from './pages/target-proposals'; import { TargetProposalsNewPage } from './pages/target-proposals-new'; +import { TargetAlertsPage } from './pages/target-alerts'; +import { TargetAlertsActivityPage } from './pages/target-alerts-activity'; +import { TargetAlertsCreatePage } from './pages/target-alerts-create'; +import { TargetAlertsDetailPage } from './pages/target-alerts-detail'; +import { TargetAlertsRulesPage } from './pages/target-alerts-rules'; import { TargetSettingsPage, TargetSettingsPageEnum } from './pages/target-settings'; import { TargetTracePage } from './pages/target-trace'; import { @@ -641,6 +646,79 @@ const targetSettingsRoute = createRoute({ }, }); +// --- Alerts (nested routes with Outlet) --- + +const targetAlertsRoute = createRoute({ + getParentRoute: () => targetRoute, + path: 'alerts', + component: function TargetAlertsRoute() { + const { organizationSlug, projectSlug, targetSlug } = targetAlertsRoute.useParams(); + return ( + + + + ); + }, +}); + +const targetAlertsIndexRoute = createRoute({ + getParentRoute: () => targetAlertsRoute, + path: '/', + component: function TargetAlertsIndexRoute() { + const params = targetAlertsIndexRoute.useParams(); + return ( + + ); + }, +}); + +const targetAlertsRulesRoute = createRoute({ + getParentRoute: () => targetAlertsRoute, + path: 'rules', + component: TargetAlertsRulesPage, +}); + +const targetAlertsActivityRoute = createRoute({ + getParentRoute: () => targetAlertsRoute, + path: 'activity', + component: TargetAlertsActivityPage, +}); + +const targetAlertsCreateRoute = createRoute({ + getParentRoute: () => targetAlertsRoute, + path: 'create', + component: TargetAlertsCreatePage, +}); + +const targetAlertsDetailRoute = createRoute({ + getParentRoute: () => targetAlertsRoute, + path: '$ruleId', + component: function TargetAlertsDetailRoute() { + const { organizationSlug, projectSlug, targetSlug, ruleId } = + targetAlertsDetailRoute.useParams(); + return ( + + ); + }, +}); + const targetLaboratoryRoute = createRoute({ getParentRoute: () => targetRoute, path: 'laboratory', @@ -1178,6 +1256,13 @@ const routeTree = root.addChildren([ targetAppVersionRoute, targetAppsRoute, targetProposalsRoute.addChildren([targetProposalsNewRoute, targetProposalsSingleRoute]), + targetAlertsRoute.addChildren([ + targetAlertsIndexRoute, + targetAlertsRulesRoute, + targetAlertsActivityRoute, + targetAlertsCreateRoute, + targetAlertsDetailRoute, + ]), ]), ]), ]); diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 90ee0e74149..9b07d34b023 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -2024,6 +2024,9 @@ importers: '@sentry/node': specifier: 7.120.2 version: 7.120.2 + '@slack/web-api': + specifier: 7.10.0 + version: 7.10.0 '@trpc/client': specifier: 10.45.3 version: 10.45.3(@trpc/server@10.45.3) @@ -13133,10 +13136,6 @@ packages: resolution: {integrity: sha512-f0cRzm6dkyVYV3nPoooP8XlccPQukegwhAnpoLcXy+X+A8KfpGOoXwDr9FLZd3wzgLaBGQBE3lY93Zm/i1JvIQ==} engines: {node: '>= 6'} - form-data@4.0.4: - resolution: {integrity: sha512-KrGhL9Q4zjj0kiUt5OO4Mr/A/jlI2jDYs5eHBpYHPcBEVSiipAvn2Ko2HnPe20rmcuuvMHNdZFp+4IlGTMF0Ow==} - engines: {node: '>= 6'} - form-data@4.0.5: resolution: {integrity: sha512-8RipRLol37bNs2bhoV67fiTEvdTrbMUYcFTiy3+wuuOnUog2QBHCZWXDRijWQfAkhBj2Uf5UnVaiWwA5vdd82w==} engines: {node: '>= 6'} @@ -29607,7 +29606,7 @@ snapshots: '@types/retry': 0.12.0 axios: 1.15.0 eventemitter3: 5.0.1 - form-data: 4.0.4 + form-data: 4.0.5 is-electron: 2.2.2 is-stream: 2.0.1 p-queue: 6.6.2 @@ -34826,14 +34825,6 @@ snapshots: hasown: 2.0.2 mime-types: 2.1.35 - form-data@4.0.4: - dependencies: - asynckit: 0.4.0 - combined-stream: 1.0.8 - es-set-tostringtag: 2.1.0 - hasown: 2.0.2 - mime-types: 2.1.35 - form-data@4.0.5: dependencies: asynckit: 0.4.0 diff --git a/scripts/seed-metric-alerts.mts b/scripts/seed-metric-alerts.mts new file mode 100644 index 00000000000..d9687d95156 --- /dev/null +++ b/scripts/seed-metric-alerts.mts @@ -0,0 +1,544 @@ +/** + * Seeds metric alert rules with 30 days of historical data and 7 days of future data. + * + * Creates: alert channels, metric alert rules (all types), incidents, state log entries, + * and sets some rules to non-NORMAL states for testing polling/live updates. + * + * Prerequisites: + * - Docker Compose is running (pnpm local:setup) + * - Services are running (pnpm dev:hive) + * - Run seed:insights first to have an org/project/target with usage data + * + * Usage: + * pnpm seed:metric-alerts + */ + +import setCookie from 'set-cookie-parser'; +import { createPostgresDatabasePool, psql } from '@hive/postgres'; + +process.env.RUN_AGAINST_LOCAL_SERVICES = '1'; +await import('../integration-tests/local-dev.ts'); + +const { ensureEnv } = await import('../integration-tests/testkit/env'); +const { addAlertChannel, addMetricAlertRule } = await import('../integration-tests/testkit/flow'); +const { getServiceHost } = await import('../integration-tests/testkit/utils'); +const { + AlertChannelType, + MetricAlertRuleType, + MetricAlertRuleMetric, + MetricAlertRuleThresholdType, + MetricAlertRuleDirection, + MetricAlertRuleSeverity, +} = await import('../integration-tests/testkit/gql/graphql'); + +// --------------------------------------------------------------------------- +// Config +// --------------------------------------------------------------------------- + +const OWNER_EMAIL = 'alerts-seed@local.dev'; +const PASSWORD = 'ilikebigturtlesandicannotlie47'; +const DAYS_PAST = 30; +const DAYS_AHEAD = 7; + +// --------------------------------------------------------------------------- +// Auth +// --------------------------------------------------------------------------- + +async function signInOrSignUp( + email: string, +): Promise<{ access_token: string; refresh_token: string }> { + const graphqlAddress = await getServiceHost('server', 8082); + + let response = await fetch(`http://${graphqlAddress}/auth-api/signup`, { + method: 'POST', + body: JSON.stringify({ + formFields: [ + { id: 'email', value: email }, + { id: 'password', value: PASSWORD }, + ], + }), + headers: { 'content-type': 'application/json' }, + }); + + let body = await response.json(); + if (body.status === 'OK') { + const cookies = setCookie.parse(response.headers.getSetCookie()); + return { + access_token: cookies.find(c => c.name === 'sAccessToken')?.value ?? '', + refresh_token: cookies.find(c => c.name === 'sRefreshToken')?.value ?? '', + }; + } + + response = await fetch(`http://${graphqlAddress}/auth-api/signin`, { + method: 'POST', + body: JSON.stringify({ + formFields: [ + { id: 'email', value: email }, + { id: 'password', value: PASSWORD }, + ], + }), + headers: { 'content-type': 'application/json' }, + }); + + body = await response.json(); + if (body.status === 'OK') { + const cookies = setCookie.parse(response.headers.getSetCookie()); + return { + access_token: cookies.find(c => c.name === 'sAccessToken')?.value ?? '', + refresh_token: cookies.find(c => c.name === 'sRefreshToken')?.value ?? '', + }; + } + + throw new Error('Failed to sign in or up: ' + JSON.stringify(body, null, 2)); +} + +// --------------------------------------------------------------------------- +// DB helpers +// --------------------------------------------------------------------------- + +function getPGConnectionString() { + const pg = { + user: ensureEnv('POSTGRES_USER'), + password: ensureEnv('POSTGRES_PASSWORD'), + host: ensureEnv('POSTGRES_HOST'), + port: ensureEnv('POSTGRES_PORT'), + db: ensureEnv('POSTGRES_DB'), + }; + return `postgres://${pg.user}:${pg.password}@${pg.host}:${pg.port}/${pg.db}?sslmode=disable`; +} + +function hoursAgo(hours: number): Date { + return new Date(Date.now() - hours * 60 * 60 * 1000); +} + +function hoursAhead(hours: number): Date { + return new Date(Date.now() + hours * 60 * 60 * 1000); +} + +function randomBetween(min: number, max: number): number { + return Math.random() * (max - min) + min; +} + +// --------------------------------------------------------------------------- +// Main +// --------------------------------------------------------------------------- + +async function main() { + console.log('🚨 Seeding metric alert rules...\n'); + + // 1. Find the first org/project/target that has usage data + const pool = await createPostgresDatabasePool({ + connectionParameters: getPGConnectionString(), + }); + + const existingTarget = await pool.maybeOne(psql` + SELECT + t."id" as "targetId", + t."clean_id" as "targetSlug", + p."id" as "projectId", + p."clean_id" as "projectSlug", + o."id" as "organizationId", + o."clean_id" as "organizationSlug" + FROM "targets" t + INNER JOIN "projects" p ON p."id" = t."project_id" + INNER JOIN "organizations" o ON o."id" = p."org_id" + ORDER BY t."created_at" DESC + LIMIT 1 + `); + + if (!existingTarget) { + console.error('❌ No targets found. Run seed:insights first.'); + process.exit(1); + } + + const { targetId, targetSlug, projectId, projectSlug, organizationId, organizationSlug } = + existingTarget as { + targetId: string; + targetSlug: string; + projectId: string; + projectSlug: string; + organizationId: string; + organizationSlug: string; + }; + + console.log(`📍 Using target: ${organizationSlug}/${projectSlug}/${targetSlug}`); + + // 2. Auth + const auth = await signInOrSignUp(OWNER_EMAIL); + const token = auth.access_token; + console.log(`🔑 Authenticated as ${OWNER_EMAIL}`); + + // 3. Create alert channels + console.log('\n📡 Creating alert channels...'); + + const slackChannel = await addAlertChannel( + { + organizationSlug: organizationSlug as string, + projectSlug: projectSlug as string, + name: 'Slack #alerts', + type: AlertChannelType.Slack, + slack: { channel: '#alerts' }, + }, + token, + ).then(r => r.expectNoGraphQLErrors()); + + const webhookChannel = await addAlertChannel( + { + organizationSlug: organizationSlug as string, + projectSlug: projectSlug as string, + name: 'PagerDuty Webhook', + type: AlertChannelType.Webhook, + webhook: { endpoint: 'https://events.pagerduty.com/v2/enqueue' }, + }, + token, + ).then(r => r.expectNoGraphQLErrors()); + + const slackChannelId = slackChannel.addAlertChannel.ok!.addedAlertChannel.id; + const webhookChannelId = webhookChannel.addAlertChannel.ok!.addedAlertChannel.id; + console.log(` Created: Slack #alerts (${slackChannelId})`); + console.log(` Created: PagerDuty Webhook (${webhookChannelId})`); + + // 4. Create metric alert rules + console.log('\n📏 Creating metric alert rules...'); + + const ruleDefs = [ + { + name: 'Error Rate Above 10% - Last 5 Min', + type: MetricAlertRuleType.ErrorRate, + metric: null, + timeWindowMinutes: 5, + thresholdType: MetricAlertRuleThresholdType.FixedValue, + thresholdValue: 10, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Critical, + channelIds: [slackChannelId, webhookChannelId], + // Will be set to FIRING + desiredState: 'FIRING' as const, + }, + { + name: 'Error Rate Above 5%', + type: MetricAlertRuleType.ErrorRate, + metric: null, + timeWindowMinutes: 60, + thresholdType: MetricAlertRuleThresholdType.FixedValue, + thresholdValue: 5, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Info, + channelIds: [slackChannelId, webhookChannelId], + desiredState: 'FIRING' as const, + }, + { + name: 'Error Rate Increased by 75% - Last Hour', + type: MetricAlertRuleType.ErrorRate, + metric: null, + timeWindowMinutes: 60, + thresholdType: MetricAlertRuleThresholdType.PercentageChange, + thresholdValue: 75, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Critical, + channelIds: [slackChannelId], + desiredState: 'NORMAL' as const, + }, + { + name: 'P99 Latency Above 2000ms', + type: MetricAlertRuleType.Latency, + metric: MetricAlertRuleMetric.P99, + timeWindowMinutes: 30, + thresholdType: MetricAlertRuleThresholdType.FixedValue, + thresholdValue: 2000, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Warning, + channelIds: [slackChannelId, webhookChannelId], + desiredState: 'FIRING' as const, + }, + { + name: 'P95 Latency Increased by 25%', + type: MetricAlertRuleType.Latency, + metric: MetricAlertRuleMetric.P95, + timeWindowMinutes: 30, + thresholdType: MetricAlertRuleThresholdType.PercentageChange, + thresholdValue: 25, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Warning, + channelIds: [webhookChannelId], + desiredState: 'NORMAL' as const, + }, + { + name: 'P99 Latency Increased by 50%', + type: MetricAlertRuleType.Latency, + metric: MetricAlertRuleMetric.P99, + timeWindowMinutes: 60, + thresholdType: MetricAlertRuleThresholdType.PercentageChange, + thresholdValue: 50, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Critical, + channelIds: [slackChannelId], + desiredState: 'NORMAL' as const, + }, + { + name: 'Request Rate Below 100 rpm', + type: MetricAlertRuleType.Traffic, + metric: null, + timeWindowMinutes: 30, + thresholdType: MetricAlertRuleThresholdType.FixedValue, + thresholdValue: 100, + direction: MetricAlertRuleDirection.Below, + severity: MetricAlertRuleSeverity.Critical, + channelIds: [slackChannelId, webhookChannelId], + desiredState: 'NORMAL' as const, + }, + { + name: 'Traffic Decreased by 60% - Last Hour', + type: MetricAlertRuleType.Traffic, + metric: null, + timeWindowMinutes: 60, + thresholdType: MetricAlertRuleThresholdType.PercentageChange, + thresholdValue: 60, + direction: MetricAlertRuleDirection.Below, + severity: MetricAlertRuleSeverity.Warning, + channelIds: [webhookChannelId], + desiredState: 'NORMAL' as const, + }, + { + name: 'Traffic Increased by 150% - Last 30 Min', + type: MetricAlertRuleType.Traffic, + metric: null, + timeWindowMinutes: 30, + thresholdType: MetricAlertRuleThresholdType.PercentageChange, + thresholdValue: 150, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Info, + channelIds: [slackChannelId, webhookChannelId], + desiredState: 'FIRING' as const, + }, + { + name: 'Traffic Increased by 200% - Last 15 Min', + type: MetricAlertRuleType.Traffic, + metric: null, + timeWindowMinutes: 15, + thresholdType: MetricAlertRuleThresholdType.PercentageChange, + thresholdValue: 200, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Warning, + channelIds: [webhookChannelId], + desiredState: 'PENDING' as const, + }, + { + name: 'Request Rate Above 10,000 rpm', + type: MetricAlertRuleType.Traffic, + metric: null, + timeWindowMinutes: 5, + thresholdType: MetricAlertRuleThresholdType.FixedValue, + thresholdValue: 10000, + direction: MetricAlertRuleDirection.Above, + severity: MetricAlertRuleSeverity.Warning, + channelIds: [slackChannelId], + desiredState: 'RECOVERING' as const, + }, + ]; + + const createdRules: Array<{ id: string; name: string; desiredState: string }> = []; + + for (const def of ruleDefs) { + const result = await addMetricAlertRule( + { + organizationSlug: organizationSlug as string, + projectSlug: projectSlug as string, + targetSlug: targetSlug as string, + name: def.name, + type: def.type, + metric: def.metric ?? undefined, + timeWindowMinutes: def.timeWindowMinutes, + thresholdType: def.thresholdType, + thresholdValue: def.thresholdValue, + direction: def.direction, + severity: def.severity, + channelIds: def.channelIds, + }, + token, + ).then(r => r.expectNoGraphQLErrors()); + + const rule = result.addMetricAlertRule.ok?.addedMetricAlertRule; + if (rule) { + createdRules.push({ id: rule.id, name: def.name, desiredState: def.desiredState }); + console.log(` ✓ ${def.name} (${rule.id})`); + } else { + console.error(` ✗ Failed to create: ${def.name}`); + } + } + + // 5. Seed state transitions, incidents, and set desired states + console.log('\n📊 Seeding 30 days of historical data + 7 days ahead...'); + + const totalHours = (DAYS_PAST + DAYS_AHEAD) * 24; + const nowHoursFromStart = DAYS_PAST * 24; + + for (const rule of createdRules) { + // Generate realistic state transition history + const transitions: Array<{ + fromState: string; + toState: string; + value: number; + previousValue: number; + thresholdValue: number; + createdAt: Date; + }> = []; + + let currentState = 'NORMAL'; + let hour = 0; + + while (hour < totalHours) { + // Random interval between events (4-48 hours) + hour += Math.floor(randomBetween(4, 48)); + if (hour >= totalHours) break; + + const eventTime = hoursAgo(nowHoursFromStart - hour); + const value = randomBetween(50, 500); + const previousValue = randomBetween(30, 300); + + // Simulate a firing cycle: NORMAL → PENDING → FIRING → RECOVERING → NORMAL + if (currentState === 'NORMAL') { + transitions.push({ + fromState: 'NORMAL', + toState: 'PENDING', + value, + previousValue, + thresholdValue: value * 0.8, + createdAt: eventTime, + }); + currentState = 'PENDING'; + + // PENDING → FIRING after a few minutes + const firingTime = new Date(eventTime.getTime() + randomBetween(2, 10) * 60000); + transitions.push({ + fromState: 'PENDING', + toState: 'FIRING', + value: value * 1.1, + previousValue: value, + thresholdValue: value * 0.8, + createdAt: firingTime, + }); + currentState = 'FIRING'; + + // FIRING → RECOVERING after some time + const recoverTime = new Date( + firingTime.getTime() + randomBetween(10, 180) * 60000, + ); + transitions.push({ + fromState: 'FIRING', + toState: 'RECOVERING', + value: value * 0.5, + previousValue: value * 1.1, + thresholdValue: value * 0.8, + createdAt: recoverTime, + }); + currentState = 'RECOVERING'; + + // RECOVERING → NORMAL after confirmation + const normalTime = new Date( + recoverTime.getTime() + randomBetween(3, 15) * 60000, + ); + transitions.push({ + fromState: 'RECOVERING', + toState: 'NORMAL', + value: value * 0.3, + previousValue: value * 0.5, + thresholdValue: value * 0.8, + createdAt: normalTime, + }); + currentState = 'NORMAL'; + } + } + + // Insert state log entries + const expiresAt = hoursAhead(DAYS_AHEAD * 24 + 24); + for (const t of transitions) { + await pool.query(psql` + INSERT INTO "metric_alert_state_log" ( + "metric_alert_rule_id", "target_id", "from_state", "to_state", + "value", "previous_value", "threshold_value", "created_at", "expires_at" + ) VALUES ( + ${rule.id}, ${targetId as string}, ${t.fromState}, ${t.toState}, + ${t.value}, ${t.previousValue}, ${t.thresholdValue}, + ${t.createdAt.toISOString()}, ${expiresAt.toISOString()} + ) + `); + } + + // Create incidents from FIRING transitions + const firingTransitions = transitions.filter(t => t.toState === 'FIRING'); + const resolvedTransitions = transitions.filter(t => t.fromState === 'RECOVERING' && t.toState === 'NORMAL'); + + for (let i = 0; i < firingTransitions.length; i++) { + const firing = firingTransitions[i]; + const resolved = resolvedTransitions[i]; + await pool.query(psql` + INSERT INTO "metric_alert_incidents" ( + "metric_alert_rule_id", "started_at", "resolved_at", + "current_value", "previous_value", "threshold_value" + ) VALUES ( + ${rule.id}, ${firing.createdAt.toISOString()}, + ${resolved ? resolved.createdAt.toISOString() : null}, + ${firing.value}, ${firing.previousValue}, ${firing.thresholdValue} + ) + `); + } + + // Set desired state for live testing + if (rule.desiredState !== 'NORMAL') { + const stateChangedAt = hoursAgo(randomBetween(0.5, 3)); + await pool.query(psql` + UPDATE "metric_alert_rules" + SET + "state" = ${rule.desiredState}, + "state_changed_at" = ${stateChangedAt.toISOString()}, + "last_evaluated_at" = NOW(), + "updated_at" = NOW() + WHERE "id" = ${rule.id} + `); + + // For FIRING rules, create an open incident + if (rule.desiredState === 'FIRING') { + await pool.query(psql` + INSERT INTO "metric_alert_incidents" ( + "metric_alert_rule_id", "started_at", + "current_value", "previous_value", "threshold_value" + ) VALUES ( + ${rule.id}, ${stateChangedAt.toISOString()}, + ${randomBetween(100, 500)}, ${randomBetween(30, 100)}, ${randomBetween(50, 200)} + ) + `); + } + + console.log(` 🔥 ${rule.name} → ${rule.desiredState}`); + } + + console.log( + ` 📈 ${rule.name}: ${transitions.length} state transitions, ${firingTransitions.length} incidents`, + ); + } + + await pool.end(); + + console.log(` +✅ Metric alerts seed complete! + +Credentials: + Email: ${OWNER_EMAIL} + Password: ${PASSWORD} + +Rules in non-NORMAL states (for polling testing): +${createdRules + .filter(r => r.desiredState !== 'NORMAL') + .map(r => ` ${r.desiredState.padEnd(12)} ${r.name}`) + .join('\n')} + +Navigate to: + http://localhost:3000/${organizationSlug}/${projectSlug}/${targetSlug}/alerts +`); +} + +main().catch(err => { + console.error('Seed failed:', err); + process.exit(1); +});