chore(hooks): render dev-test plugin via the prod generator#2884
chore(hooks): render dev-test plugin via the prod generator#2884mfbx9da4 wants to merge 27 commits into
Conversation
Every scanner now emits kebab-case rule ids and human-readable descriptions that never echo the matched value. Adds a central rule catalog, plumbs RuleContext through every Finding writer, and exposes a structured DenyReason from the shadowmcp validator. Frontend resolves unknown kebab rule ids by title-casing them so new backend rules render legibly without a dashboard release.
The detection mechanism (missing toolset id, unknown toolset, ...) is implementation detail, not the risk itself. The actionable risk is the same in every case: an unverified MCP tool call. Revert the validator DenyReason addition and the four-way rule split; emit a single 'shadow-mcp' rule_id whose description still names the tool.
Group rule ids by risk category via a dotted prefix so consumers can bucket findings without switching on `source`: secret.<rule> gitleaks pii.<rule> presidio shadow-mcp shadow_mcp (unchanged) destructive.tool destructive_tool destructive.cli-<cmd> cli_destructive pi ML classifier prompt injection pi.<heuristic> L0 heuristic prompt injection Collapses the two role-hijack subrules into a single `pi.role-hijack` and drops the `deberta-v3-classifier` model name from the rule id (model is implementation detail). cli_destructive drops the implicit `shell`/`cloud` category prefixes; `git`/`database` stay because they convey real grouping. Adds CanonicalGitleaksRuleID, CanonicalPresidioRuleID, and CanonicalCLIDestructiveRuleID helpers; the frontend canonicalizer mirrors them.
Drop the cliDestructiveCanonical indirection table in rules.go. Update cliDestructivePattern.FullName() to emit `destructive.<category>.<name>` directly (e.g. `destructive.shell.rm-rf`, `destructive.git.push-force`) so the writer can use it verbatim as the canonical rule id. The shell/git/database/cloud groupings now form a real second layer under `destructive.`, peers of `destructive.tool`.
The standalone L1 classifier rule was `pi`; heuristic sub-rules used `pi.<heuristic>`. Rename the whole family to `prompt-injection` and `prompt-injection.<heuristic>` so the prefix is self-describing and matches the source name. Consumers pattern-matching on the prefix get both the L1 verdict and the L0 heuristics under one bucket.
DETECTION_RULES now stores rule ids in the canonical form findings carry (secret./pii./prompt-injection.unknown). PolicyCenter.tsx translates canonical -> Presidio's UPPER_SNAKE wire format at the payload boundary via the new ruleIdToPresidioEntity helper. prompt_injection_rules ships the canonical rule_id verbatim; the backend's L1 opt-in key is the same string as the resulting finding rule_id. The deberta classifier verdict lands at prompt-injection.unknown — the binary model can't identify a specific attack family, so it sits alongside the L0 prompt-injection.* heuristics rather than at the standalone prompt-injection prefix.
Add ValidateRuleID + a kebab-with-optional-dots grammar regex. Normalize panics on a malformed canonical id when the binary is a test binary (testing.Testing()) or after cmd/gram wires EnableRuleIDFormatEnforcement() on for the local environment. Production runs leave the value through, so a future writer drift doesn't take down scanning. Catches writer regressions at the boundary instead of letting them contaminate risk_results, which keeps the public rule_id contract honest.
…eature flag Drop all sub-rules under the prompt-injection prefix; every prompt-injection finding now carries rule_id = "prompt-injection" regardless of whether it was produced by the L1 deberta classifier or an L0 heuristic regex/keyword pattern. The detection engine is an implementation detail, not part of the public contract. Engine selection moves from per-policy opt-in to per-org feature flag: FlagPromptInjectionUseRegex (default off) → L1 classifier (deberta) FlagPromptInjectionUseRegex on → L0 heuristics (regex) StubClassifier (no --pi-classifier-url) falls back to L0 unconditionally so local-dev still produces findings. ScanBatch/Scan now take orgID instead of a rules slice; the risk_policies.prompt_injection_rules field becomes vestigial (left on schema for backward compatibility; future PR can drop it). Frontend: DETECTION_RULES.prompt_injection is empty — the prompt-injection category is a single toggle. PolicyCenter no longer emits per-rule selection; categoriesToPayload signature is simplified.
…t-in Rename FlagPromptInjectionUseRegex -> FlagPromptInjectionUseClassifier and invert the default. Out of the box every org gets the L0 regex engine; an org opts in to the deberta classifier by flipping the feature flag. Reason: the classifier requires a sidecar deployment that not every environment runs. Defaulting to the always-available regex keeps the out-of-the-box experience working and turns the classifier into a deliberate uplift rather than an implicit dependency.
guardRuleIDs sweeps every InsertRiskResultsParams before the INSERT and runs ValidateRuleID on its rule_id. In dev/test mode an offending row panics so writer drift fails CI immediately; in production the row is dropped (and logged with risk.source / risk.rule_id) so a single misbehaving scanner can't break the whole batch. This is the second layer of defense, stacked behind the existing Normalize check at the canonical-id construction site. A DB CHECK constraint will follow in the migration PR for the third (always-on) layer.
… custom message on block PolicyCenter: - Drop the standalone deberta-v3-classifier rule from the prompt_injection category UI. The engine (regex vs deberta) is an org-level concern flipped via the prompt-injection-use-classifier feature flag, not something the policy author picks. The category becomes a single on/off checkbox. - Drop formPromptInjectionRules / setFormPromptInjectionRules state, the useRiskCapabilities + piClassifierEnabled wiring, and the special-case rendering branch that allowed per-rule selection under prompt_injection. - Render the Custom Message field only when action='block'. Flag-action policies record findings silently and never surface a user-facing reason, so the field is irrelevant there.
canonicalizeRuleId was a stop-gap from when DETECTION_RULES.id stored wire-format ids (UPPER_SNAKE Presidio entities, raw gitleaks names). After the earlier rewrite, DETECTION_RULES.id is already canonical end to end, so every call site was a no-op. Remove the function and the import; SecurityOverview's lookup maps now key on rule.id directly. ruleIdToPresidioEntity and humanizeRuleId stay — they handle the only remaining transforms (canonical -> Presidio wire format, and unknown-id display fallback). Also drop the prompt_attacks category from the policy form: it was a placeholder that did not map to a real backend source, and its entries were never actionable.
Unknown findings fall through to humanizeRuleId for their display label. Reading 'Pii Credit Card' or 'Destructive Cli Rm Rf' is jarring; this adds a small acronym set (pii, cli, mcp, api, ssn, nhs, iban, ...) that gets upper-cased instead of title-cased. Examples: pii.credit-card -> 'PII Credit Card' pii.us-ssn -> 'PII US SSN' pii.uk-nhs -> 'PII UK NHS' destructive.cli-rm-rf -> 'Destructive CLI Rm Rf'
…cy ids Risk_results rows written before this PR still carry legacy formats (MEDICAL_LICENSE, shadow_mcp.unverified_call, cli_destructive.shell/rm-rf, pi.role-hijack.you-are-now, ...). Until the data migration ships, the dashboard renders them via humanizeRuleId. Splitting on every separator (., -, _, /) means legacy ids read legibly during the overlap window instead of as Frankenstein 'Cli_destructive Shell/rm Rf' strings.
…ibe builders RuleContext was a bag of optional fields (ToolName, MatchedPattern) forwarded through a string-keyed catalog to a single Normalize entry point. Every call site passed exhaustruct-zero values for the fields it didn't care about, and adding a new scanner with new context would force the shared struct to grow — open/closed violation. Replace with one typed function per emit shape: DescribeShadowMCP(toolName) -> (ruleID, description) DescribeDestructiveTool(toolName) -> ... DescribeCLIDestructive(pattern, toolName)-> ... DescribePresidioEntity(rawEntityType) -> ... DescribePresidioDeadLetter() -> ... DescribeGitleaks(raw, upstreamDesc) -> ... DescribePromptInjection() -> ... Each builder accepts exactly the inputs it needs. A new scanner adds a new function instead of extending a shared type. guard() (kept in rules.go) still validates every canonical id against the grammar in dev/test, so the rule_id format guarantee is unchanged. ValidateRuleID, guardRuleIDs in writeResults, and the EnableRuleIDFormatEnforcement opt-in all stay.
…ult-rule-ids # Conflicts: # .speakeasy/workflow.lock # client/sdk/.speakeasy/gen.lock # server/gen/http/openapi3.json
… dev/test-only
Two follow-ups on the merge:
1. Reverted the verbose risk_results attribute descriptions I added to
server/design/shared/risk.go. They were internal commentary ("never
leaks internal validator detail", "do not surface unredacted in
public contracts") that should not appear in the public OpenAPI /
SDK docs. Restored the original terse customer-facing strings; Goa
+ TypeScript SDK regenerated.
2. Fixed a production correctness bug in AnalyzeBatch.guardRuleIDs:
the previous version dropped any row with a malformed rule_id in
prod. Because writeResults first DELETEs all rows for the message
batch and the unanalyzed-message query treats 'no risk_results row'
as the signal to re-scan, a malformed rule_id caused the same
message to be re-scanned every batch. guardRuleIDs is now dev/test-
only: in those modes a malformed id panics; in production the
function is a no-op pass-through, so analyzed-state semantics are
preserved.
The previous refactor turned the engines into a mutually-exclusive either/or — flag off → L0 only, flag on → L1 only (with L0 as a fallback on classifier error). That was a regression: pre-refactor behavior was L0 always runs, L1 is appended when opted in. The only thing the feature flag change was supposed to do was move the L1 opt-in from policy config to an org-level feature flag. Restore the additive model: - PromptInjectionScanner.Scan / ScanBatch always run runHeuristics(text). - When FlagPromptInjectionUseClassifier is on for the org AND the classifier is non-stub, also run the classifier and append any L1 finding to the L0 set. - Classifier errors leave L0 findings intact (no swap to "fallback"). - Stub classifier is treated as "no L1 deployed" regardless of flag. Rename the gate to classifierEnabled() to match the new semantics. Tests rewritten to cover L0-only, L0+L1 (both fire), L0+L1 (only L1 fires on benign text), SAFE label, classifier error, and stub classifier paths.
rules.go keeps the central rule_id surface: grammar guard, canonical
rule_id constants, and the per-source canonicalizers
(CanonicalGitleaksRuleID, CanonicalPresidioRuleID). The description
sentences move next to the scanner that owns them:
- gitleaks.go → DescribeGitleaks
- presidio.go → DescribePresidioEntity, DescribePresidioDeadLetter,
presidioEntityDescriptions
- pi_scanner.go → DescribePromptInjection
- cli_destructive.go → DescribeCLIDestructive, cliCommandHumanForm
- analyze_batch.go → DescribeShadowMCP, DescribeDestructiveTool
(their writers live in this file too)
Net: rule_id contract surface stays in one place; presentation logic
lives with the source that produces it.
The RiskBadgePopover badge was rendering the source (`presidio`, `gitleaks`, ...) which leaks the implementation detail policy authors never think in. Switch to the category prefix of the rule id: pii.email-address → PII secret.anthropic-api-key → SECRET destructive.shell.rm-rf → DESTRUCTIVE shadow-mcp → SHADOW-MCP prompt-injection → PROMPT-INJECTION Adds a small ruleIdCategoryLabel(ruleId) helper in rule-ids.ts that takes the first dot-delimited segment (or the whole id when undotted) and uppercases it. Falls back to the uppercased source if no rule_id.
Presidio aggressively detects `::` (and the all-zero variants `::0`, `0::0`, `0:0:0:0:0:0:0:0`) as an IP_ADDRESS. It appears in code, networking configs, and stack traces but is never a meaningful PII finding, so it just noises up the security overview. Add a narrow isPresidioFalsePositive filter applied in convertPresidioFindings that drops these matches before they become a Finding. Real addresses (127.0.0.1, 2001:db8::1, ::1 loopback) still flow through.
This change adds a new `AppendBatch` function that can bulk insert outbox entries within a transaction.
The local hooks:test harness used a hand-maintained send_hook.sh stub under hooks/plugin-claude-test/ that had drifted from the templated hook.sh that server/internal/plugins.GenerateObservabilityPluginPackage emits for real publishes. The stub had no Gram-Key/Gram-Project headers, so even with --local the hook arrived unauthenticated and PreToolUse enforcement silently no-op'd before OTEL session metadata landed in Redis. Add a tiny server/cmd/dev-observability-plugin entrypoint that wraps GenerateObservabilityPluginPackage, and have test.sh provision a hooks-scoped api key, render the plugin into hooks/.local/plugin-claude/ on the fly, then exec claude --plugin-dir against it. The test path now exercises the exact bytes prod ships, just with a locally-issued key wired through. Delete the stale stub dir, ignore the generated output dir, and keep --local as a deprecated no-op so existing muscle memory still works.
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Stacked on #2855.
Why
mise run hooks:testshipped a hand-maintainedsend_hook.shstub underhooks/plugin-claude-test/that had drifted from the templatedhook.shthatserver/internal/plugins.GenerateObservabilityPluginPackageemits for real publishes. Two concrete consequences:Gram-Key/Gram-Projectheaders, so even with--localthe hook arrived unauthenticated. With OTEL session metadata not yet in Redis on the first prompt,resolveClaudeScanProjectIDreturnedok=falseandPreToolUseenforcement silently no-op'd, so the local harness couldn't actually exercise blocking policies on the first tool call of a session.The default (non-
--local) mode was also broken in practice: it picks up the published plugin from~/.claude/plugins/cache/..., which bakes agram_live_key tied to whatever environment first published the plugin. Locally, that key 401s.What changed
Before: two parallel implementations of
send_hook.sh. One in the repo (stub, drifting), one generated by the server (canonical). Tests used the stub. Prod used the generator.After: one implementation.
test.shprovisions ahooks-scoped api key in the local DB, then calls a newserver/cmd/dev-observability-pluginhelper that wrapsplugins.GenerateObservabilityPluginPackageto render the plugin intohooks/.local/plugin-claude/. Claude is launched with--plugin-diragainst that directory. The bytes Claude executes are now identical to what prod ships, just with a locally-issued key.Other changes:
test.sh: missing project / missing users are errors instead of warnings (the rest of the script was already assuming the key got provisioned).hooks/plugin-claude-test/(the stale stub).--localas a deprecated no-op so existing muscle memory still works.hooks/.local/since it's regenerated on every run.hooks/plugin-claude/and.mise-tasks/hooks/zip.share intentionally left alone. The zip output isn't referenced anywhere in the server or build pipeline, so that's a separate cleanup.Test plan
mise run hooks:test --project defaultprovisions a key, renders the plugin, launches Claudehooks/.local/plugin-claude/hooks/hook.shand confirm it has the locally-issuedgram_local_...key +Gram-Project: defaultheadersUserPromptSubmitwith a string matching a blocking risk policy and confirm the hook returns 403 / the prompt is blockedmise run hooks:test --project default --localstill works (no-op flag accepted)