feat(oncall-agent): read-only Slack-triggered cluster triage agent by manan164 · Pull Request #293 · agentspan-ai/agentspan

manan164 · 2026-06-29T16:06:18Z

feat(oncall-agent): read-only Slack-triggered cluster triage agent

Draft. The scaffold, the read-only investigation loop, and a per-alert-type triage
playbook for all 17 health-check alert types now work. What remains is a live
end-to-end run against ah5r-prod and a hypothesis-quality eval.

What this is

An Agentspan agent that triages Orkes SaaS cluster health-check alerts. It polls the Slack
alert channel (Web API, no Socket Mode), parses the executionId from the failing
health_check execution URL, runs read-only agent-handler commands against the ah5r-prod
Conductor API to investigate, and replies in-thread with a root-cause hypothesis.

It is strictly read-only and advisory — it never takes a remediating action. SQL is gated by
a deterministic SELECT-only guard (sql_guard.py), not by trusting the model. The agent reads
organizationId / clusterName / cloudEnvironmentTag off the failing execution, so the LLM
only threads the executionId into each tool — it cannot target the wrong cluster and no secrets
pass through tool args.

Done ✅

Slack ingestion — Web API poller (conversations.history + chat.postMessage), bot token
only, run-once / --loop, state-file dedup. Matches sdk/python/examples/91_slack_autofix_agent.py.
Alert parsing (alert.py) — extracts execution id + severity + org/cluster from the alert text.
Conductor dispatch (conductor_client.py) — starts agent-handler command workflows on
ah5r-prod via conductor-python (app key/secret), polls to completion, derives + caches cluster context.
Read-only tool set (tools.py) — get_incident_details, get_cluster_metrics,
get_infrastructure_metrics, get_pods_data, get_deployments_info, get_pod_events,
get_top_output, pull_pod_logs, get_ingress_info, run_sql_select (SELECT-guarded).
SQL safety guard (sql_guard.py) — SELECT/WITH/EXPLAIN/SHOW only; rejects every mutation,
multi-statement, and comment-smuggling case before it reaches the DB.
Per-alert-type triage playbook for all 17 health-check alert types (agent.py) — symptom →
evidence-to-gather → what-to-cite, matched off the issues text. Keeps the strong Redis /
decider-queue / CPU / heap guidance and extends to component-down, pod, networking, and
self-describing alerts (table below).
Tests (deterministic, no LLM in the assertion path, per CLAUDE.md) — test_sql_guard.py,
test_alert.py, test_poller.py, test_conductor_client.py, test_tools_readonly.py
(read-only safety guard, pinned to the real AgentHandlerCommand enum names),
test_tools_dispatch.py (each tool dispatches its expected agent-handler command). 35 passing.
Local run path (python -m oncall_agent.main [triage <execId>]), .env.example, README.

To do 🚧

End-to-end run against ah5r-prod — needs Conductor app key/secret, Slack app tokens, and the
Agentspan server running with ANTHROPIC_API_KEY. Not yet exercised live.
Hypothesis-quality eval — run dry-run on real alerts and score the root-cause hypotheses
before trusting it. Tune the playbook from real misses.
Remediation — deliberately out of scope for v1; when added it must go behind the Agentspan
HITL approval gate.

The 17 health-check alert types — all now have a playbook

Source of truth: HealthIssue enum in
orkes-saas/.../worker/HealthCheckIssuesWorker.java. "Approach" = how agent.py triages it.

#	Alert type	Sev	Triage approach
1	`REDIS_CRITICAL_USAGE`	CRITICAL	decider-queue backlog → server/worker logs
2	`REDIS_HIGH_USAGE`	MAJOR	decider-queue backlog → server/worker logs
3	`CONDUCTOR_HIGH_HEAP_USAGE`	MAJOR	top + infra metrics → logs grep OutOfMemory/GC
4	`CONDUCTOR_HIGH_CPU_USAGE`	MAJOR	top + infra metrics → hot-pod logs
5	`CONDUCTOR_ERROR_LOGS_COUNT_EXCEEDED_THRESHOLD`	MAJOR	server logs → name dominant exception
6	`CONDUCTOR_WARN_LOGS_COUNT_EXCEEDED_THRESHOLD`	MINOR	server logs → name dominant warning
7	`CONDUCTOR_HEALTHY` (failed)	CRITICAL	conductor pod events + logs (crashloop/OOM/image)
8	`WORKERS_HEALTHY` (failed)	CRITICAL	worker pod events + logs
9	`PROMETHEUS_NOT_RUNNING`	MAJOR	prometheus pod events + logs (note: metrics may be stale)
10	`POD_NOT_RUNNING`	MAJOR	pod events (schedule/image/OOM) + logs
11	`POD_RESTARTED`	MAJOR	pod events for reason + pre-crash logs
12	`DNS_HEALTHY` (failed)	CRITICAL	`get_ingress_info` — no address ⇒ LB unprovisioned; else external → infra
13	`DOMAIN_RESOLUTION`	CRITICAL	`get_ingress_info` → resolve vs escalate to infra
14	`DOMAIN_REACHABILITY`	CRITICAL	`get_ingress_info` → reachable vs escalate to infra
15	`AUTH_STALE`	MAJOR	self-describing → relay + rotate cluster API key (remediation)
16	`DOMAIN_CERTIFICATE_WILL_EXPIRE`	MINOR	self-describing (domain + days in msg) → renew cert
17	`RESPONSE_TIME`	MINOR	optionally correlate CPU/heap/restarts, else relay latency

On testing the playbook: the playbook is prompt text — by CLAUDE.md rule 1 it can't be
LLM-judged in unit tests and isn't deterministically assertable, so it has no unit test by design.
What is tested deterministically: the read-only safety guard and the per-tool dispatch contract
(get_ingress_info → GET_INGRESS_INFO, mutating commands stay unreachable). Playbook quality is
validated in the eval step above.

Testing

cd oncall-agent
PYTHONPATH=src python -m pytest -q   # 35 passing

Out of scope (v1)

Triage + read-only investigation only. No remediation (restart/scale/rollback).

Scaffolds an Agentspan agent that triages Orkes SaaS health-check alerts. It listens on the Slack alert channel, reads the failing health_check execution by id, and runs READ-ONLY agent-handler commands against the ah5r-prod Conductor API to investigate, then replies in-thread with a root-cause hypothesis. Advisory/dry-run only — no remediating actions. - sql_guard: deterministic SELECT-only guard (not LLM-trusted) for the run_sql_select tool; rejects DML/DDL, multi-statement, comment-smuggling. - conductor_client: dispatch read-only agent-handler workflows + poll; reads org/cluster/cloudEnvironmentTag off the failing execution so only the executionId is threaded into tools. - tools: 9 read-only investigation tools mapped to agent-handler commands. - agent: Claude triage loop with a component->investigation runbook. - slack_app: Socket Mode listener -> triage -> threaded reply. - tests: sql_guard + alert parsing, deterministic (no LLM), validated by proving each fails before passing per CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ention Follows the pattern in sdk/python/examples/91_slack_autofix_agent.py (per PR #135): poll the alert channel with conversations.history + reply via chat.postMessage using a bot token only — no slack_bolt / Socket Mode, no app-level token. Run-once or --loop, dedup via a local state file. Slack I/O lives in a deterministic poller; the triage agent stays pure (investigates a given execution id). Adds test_poller.py covering alert-only triage, cross-poll dedup, and failure reporting (fakes, no network/LLM; validated fail-then-pass per CLAUDE.md). Config: drop SLACK_APP_TOKEN; add SLACK_ALERT_CHANNEL (required), ONCALL_POLL_INTERVAL, ONCALL_STATE_FILE. requirements: drop slack-bolt, add requests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- test_tools_readonly: source-level invariant that no mutating/privileged agent-handler command is wired into tools.py, and SQL goes through the SELECT guard. Validated by adding DELETE_POD and confirming failure. - scripts/smoke_dispatch.py: LLM-free live check against ah5r-prod — reads the failing execution's cluster context, dispatches read-only commands (get_pods_data, get_cluster_metrics, SELECT 1), asserts COMPLETED + output shape. The L1 verification step; run with the Conductor app key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…r domain Two bugs found during the first live run against ah5r-prod (viz-stage): 1. cloudEnvironmentTag is NOT in the health_check workflow input (it's produced by prepare_agent_handler's output), but sql_conductor reads it from workflow.input — derive it as c<orgId[:5]>-<clusterName> when absent. 2. Dispatched commands set no task_to_domain, so customer-cluster tasks (e.g. collect_metrics) sat in the default queue and the in-cluster agent never polled them -> TIMED_OUT. Mirror the control plane: wildcard "*" -> the cluster domain (orgId#-#clusterName), with orchestration tasks pinned to NO_DOMAIN. Switched dispatch to StartWorkflowRequest to carry task_to_domain. Verified live: GET_PODS_DATA and PULL_LOGS now COMPLETE end-to-end with real viz-stage data. Adds deterministic regression tests (fake client) for both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Validated end-to-end against ah5r-prod (viz-stage Redis-critical alert): the agent now autonomously reads the health-check data and pulls server + worker logs to reach a root-cause hypothesis. Fixes found during the live run: - runtime_compat: on macOS, run conductor tool-workers as THREADS, not forked processes. Forked children segfault in getaddrinfo (Network.framework is not fork-safe) and 'spawn' can't pickle the worker's thread lock. agentspan ships a thread shim but gates it to Windows; reuse it on macOS. No-op on Linux (where fork is safe — how this runs in prod). Wired into triage + slack paths. - get_incident_details: surface parse_conductor_cluster_data (redis.usage, decider_queue_size = running workflows, indexer_queue_size, heap, cpu, postgres) so the agent reads the queue numbers from the health-check JSON instead of deriving them via SQL. - runbook: treat queue/usage as the symptom; find the cause in CONDUCTOR SERVER and WORKER pod logs (+ pod events). Explicitly forbid ad-hoc SQL on large tables like `workflow` (decider_queue_size already is the running-workflow count). run_sql_select is a last resort, not the primary tool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…check alerts - agent.py: add a compact alert-type playbook covering all 17 HealthIssue types (resource saturation, component-down, pod, networking, self-describing). Preserves the existing Redis/CPU/heap guidance; routes the model symptom -> evidence -> cite. - tools.py: add read-only get_ingress_info (GET_INGRESS_INFO) for DNS / domain resolution / reachability alerts (empty ingress address = LB not provisioned). - test_tools_dispatch.py: assert each tool dispatches its expected agent-handler command (incl. get_ingress_info) via a fake dispatcher. Validated fail-then-pass. - test_tools_readonly.py: fix command names to match the AgentHandlerCommand enum (ROLLOUT_RESTART, KUBECTL_UNRESTRICTED, ...) so the guard actually catches them; add heavy/disruptive commands (DOWNLOAD_HEAP_DUMP, etc.) to the deny list.

…verage Verified against the source workflow/worker (not assumed): - get_incident_details ref "issues" matches health_check.json taskReferenceName, and the issues task output carries the per-issue severity+description text the playbook matches on (HealthCheckIssuesWorker) — the 17-type playbook is actually reachable. - alert.py: the real Slack text is markdown (*`[CRITICAL]`* … _Org's_ env *`cluster`*) + emoji + appended execution URL. The old _CLUSTER_RE choked on the italic/bold markers and returned org/cluster=None. Strip *,` decoration and tolerate italic _ so org/cluster parse; execution id + severity were already fine. Added a test built from the exact worker+notify format; validated fail-then-pass. - test_playbook_coverage.py: deterministic guard that all 17 HealthIssue types stay in the playbook; validated it fails when a type is dropped.

…ai.agents The main merge into this branch (cd689ae) renamed the Python SDK package from `agentspan` to `conductor` (dist `conductor-agent-sdk`), import root `conductor.ai.agents`. oncall-agent still imported `agentspan.agents`, so the module could not import its SDK at all on the post-merge branch. Swap the five SDK imports (Agent, tool, AgentRuntime, worker_manager shim) to the new namespace. Left the local `agentspan_server_url` config field + AGENTSPAN_SERVER_URL env var as-is — they're our own naming, not SDK symbols, and the env var is documented. Full suite passes (37) against the new namespace.

…spatcher Runs the actual conductor.ai.agents AgentRuntime + agent reasoning + tool-calling against the local server, with the Conductor dispatcher replaced by canned fixtures so it never touches ah5r-prod. Validation is deterministic (CLAUDE.md): asserts every dispatched command is read-only and the runtime returned output — does NOT LLM-judge the hypothesis. Verified live: agent reads the incident first, then dispatches only read-only commands (GET_CLUSTER_METRICS, GET_PODS_DATA, GET_POD_EVENTS, PULL_LOGS x2), follows the Redis->decider-queue->logs playbook, and emits the Issue/Findings/Root-cause/Next-step summary. Proves the SDK migration runs end-to-end and the read-only guard holds at runtime.

manan164 and others added 10 commits June 15, 2026 19:32

Merge branch 'main' into feat/oncall-triage-agent

cd689ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(oncall-agent): read-only Slack-triggered cluster triage agent#293

feat(oncall-agent): read-only Slack-triggered cluster triage agent#293
manan164 wants to merge 10 commits into
mainfrom
feat/oncall-triage-agent

manan164 commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

manan164 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(oncall-agent): read-only Slack-triggered cluster triage agent

What this is

Done ✅

To do 🚧

The 17 health-check alert types — all now have a playbook

Testing

Out of scope (v1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

manan164 commented Jun 29, 2026 •

edited

Loading