Skip to content

feat(oncall-agent): read-only Slack-triggered cluster triage agent#293

Draft
manan164 wants to merge 10 commits into
mainfrom
feat/oncall-triage-agent
Draft

feat(oncall-agent): read-only Slack-triggered cluster triage agent#293
manan164 wants to merge 10 commits into
mainfrom
feat/oncall-triage-agent

Conversation

@manan164

@manan164 manan164 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

feat(oncall-agent): read-only Slack-triggered cluster triage agent

Draft. The scaffold, the read-only investigation loop, and a per-alert-type triage
playbook for all 17 health-check alert types
now work. What remains is a live
end-to-end run against ah5r-prod and a hypothesis-quality eval.

What this is

An Agentspan agent that triages Orkes SaaS cluster health-check alerts. It polls the Slack
alert channel (Web API, no Socket Mode), parses the executionId from the failing
health_check execution URL, runs read-only agent-handler commands against the ah5r-prod
Conductor API to investigate, and replies in-thread with a root-cause hypothesis.

It is strictly read-only and advisory — it never takes a remediating action. SQL is gated by
a deterministic SELECT-only guard (sql_guard.py), not by trusting the model. The agent reads
organizationId / clusterName / cloudEnvironmentTag off the failing execution, so the LLM
only threads the executionId into each tool — it cannot target the wrong cluster and no secrets
pass through tool args.

Done ✅

  • Slack ingestion — Web API poller (conversations.history + chat.postMessage), bot token
    only, run-once / --loop, state-file dedup. Matches sdk/python/examples/91_slack_autofix_agent.py.
  • Alert parsing (alert.py) — extracts execution id + severity + org/cluster from the alert text.
  • Conductor dispatch (conductor_client.py) — starts agent-handler command workflows on
    ah5r-prod via conductor-python (app key/secret), polls to completion, derives + caches cluster context.
  • Read-only tool set (tools.py) — get_incident_details, get_cluster_metrics,
    get_infrastructure_metrics, get_pods_data, get_deployments_info, get_pod_events,
    get_top_output, pull_pod_logs, get_ingress_info, run_sql_select (SELECT-guarded).
  • SQL safety guard (sql_guard.py) — SELECT/WITH/EXPLAIN/SHOW only; rejects every mutation,
    multi-statement, and comment-smuggling case before it reaches the DB.
  • Per-alert-type triage playbook for all 17 health-check alert types (agent.py) — symptom →
    evidence-to-gather → what-to-cite, matched off the issues text. Keeps the strong Redis /
    decider-queue / CPU / heap guidance and extends to component-down, pod, networking, and
    self-describing alerts (table below).
  • Tests (deterministic, no LLM in the assertion path, per CLAUDE.md) — test_sql_guard.py,
    test_alert.py, test_poller.py, test_conductor_client.py, test_tools_readonly.py
    (read-only safety guard, pinned to the real AgentHandlerCommand enum names),
    test_tools_dispatch.py (each tool dispatches its expected agent-handler command). 35 passing.
  • Local run path (python -m oncall_agent.main [triage <execId>]), .env.example, README.

To do 🚧

  • End-to-end run against ah5r-prod — needs Conductor app key/secret, Slack app tokens, and the
    Agentspan server running with ANTHROPIC_API_KEY. Not yet exercised live.
  • Hypothesis-quality eval — run dry-run on real alerts and score the root-cause hypotheses
    before trusting it. Tune the playbook from real misses.
  • Remediation — deliberately out of scope for v1; when added it must go behind the Agentspan
    HITL approval gate.

The 17 health-check alert types — all now have a playbook

Source of truth: HealthIssue enum in
orkes-saas/.../worker/HealthCheckIssuesWorker.java. "Approach" = how agent.py triages it.

# Alert type Sev Triage approach
1 REDIS_CRITICAL_USAGE CRITICAL decider-queue backlog → server/worker logs
2 REDIS_HIGH_USAGE MAJOR decider-queue backlog → server/worker logs
3 CONDUCTOR_HIGH_HEAP_USAGE MAJOR top + infra metrics → logs grep OutOfMemory/GC
4 CONDUCTOR_HIGH_CPU_USAGE MAJOR top + infra metrics → hot-pod logs
5 CONDUCTOR_ERROR_LOGS_COUNT_EXCEEDED_THRESHOLD MAJOR server logs → name dominant exception
6 CONDUCTOR_WARN_LOGS_COUNT_EXCEEDED_THRESHOLD MINOR server logs → name dominant warning
7 CONDUCTOR_HEALTHY (failed) CRITICAL conductor pod events + logs (crashloop/OOM/image)
8 WORKERS_HEALTHY (failed) CRITICAL worker pod events + logs
9 PROMETHEUS_NOT_RUNNING MAJOR prometheus pod events + logs (note: metrics may be stale)
10 POD_NOT_RUNNING MAJOR pod events (schedule/image/OOM) + logs
11 POD_RESTARTED MAJOR pod events for reason + pre-crash logs
12 DNS_HEALTHY (failed) CRITICAL get_ingress_info — no address ⇒ LB unprovisioned; else external → infra
13 DOMAIN_RESOLUTION CRITICAL get_ingress_info → resolve vs escalate to infra
14 DOMAIN_REACHABILITY CRITICAL get_ingress_info → reachable vs escalate to infra
15 AUTH_STALE MAJOR self-describing → relay + rotate cluster API key (remediation)
16 DOMAIN_CERTIFICATE_WILL_EXPIRE MINOR self-describing (domain + days in msg) → renew cert
17 RESPONSE_TIME MINOR optionally correlate CPU/heap/restarts, else relay latency

On testing the playbook: the playbook is prompt text — by CLAUDE.md rule 1 it can't be
LLM-judged in unit tests and isn't deterministically assertable, so it has no unit test by design.
What is tested deterministically: the read-only safety guard and the per-tool dispatch contract
(get_ingress_infoGET_INGRESS_INFO, mutating commands stay unreachable). Playbook quality is
validated in the eval step above.

Testing

cd oncall-agent
PYTHONPATH=src python -m pytest -q   # 35 passing

Out of scope (v1)

Triage + read-only investigation only. No remediation (restart/scale/rollback).

manan164 and others added 10 commits June 15, 2026 19:32
Scaffolds an Agentspan agent that triages Orkes SaaS health-check alerts.
It listens on the Slack alert channel, reads the failing health_check
execution by id, and runs READ-ONLY agent-handler commands against the
ah5r-prod Conductor API to investigate, then replies in-thread with a
root-cause hypothesis. Advisory/dry-run only — no remediating actions.

- sql_guard: deterministic SELECT-only guard (not LLM-trusted) for the
  run_sql_select tool; rejects DML/DDL, multi-statement, comment-smuggling.
- conductor_client: dispatch read-only agent-handler workflows + poll;
  reads org/cluster/cloudEnvironmentTag off the failing execution so only
  the executionId is threaded into tools.
- tools: 9 read-only investigation tools mapped to agent-handler commands.
- agent: Claude triage loop with a component->investigation runbook.
- slack_app: Socket Mode listener -> triage -> threaded reply.
- tests: sql_guard + alert parsing, deterministic (no LLM), validated by
  proving each fails before passing per CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ention

Follows the pattern in sdk/python/examples/91_slack_autofix_agent.py
(per PR #135): poll the alert channel with conversations.history + reply
via chat.postMessage using a bot token only — no slack_bolt / Socket Mode,
no app-level token. Run-once or --loop, dedup via a local state file.

Slack I/O lives in a deterministic poller; the triage agent stays pure
(investigates a given execution id). Adds test_poller.py covering
alert-only triage, cross-poll dedup, and failure reporting (fakes, no
network/LLM; validated fail-then-pass per CLAUDE.md).

Config: drop SLACK_APP_TOKEN; add SLACK_ALERT_CHANNEL (required),
ONCALL_POLL_INTERVAL, ONCALL_STATE_FILE. requirements: drop slack-bolt,
add requests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- test_tools_readonly: source-level invariant that no mutating/privileged
  agent-handler command is wired into tools.py, and SQL goes through the
  SELECT guard. Validated by adding DELETE_POD and confirming failure.
- scripts/smoke_dispatch.py: LLM-free live check against ah5r-prod — reads
  the failing execution's cluster context, dispatches read-only commands
  (get_pods_data, get_cluster_metrics, SELECT 1), asserts COMPLETED + output
  shape. The L1 verification step; run with the Conductor app key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r domain

Two bugs found during the first live run against ah5r-prod (viz-stage):

1. cloudEnvironmentTag is NOT in the health_check workflow input (it's
   produced by prepare_agent_handler's output), but sql_conductor reads it
   from workflow.input — derive it as c<orgId[:5]>-<clusterName> when absent.

2. Dispatched commands set no task_to_domain, so customer-cluster tasks (e.g.
   collect_metrics) sat in the default queue and the in-cluster agent never
   polled them -> TIMED_OUT. Mirror the control plane: wildcard "*" -> the
   cluster domain (orgId#-#clusterName), with orchestration tasks pinned to
   NO_DOMAIN. Switched dispatch to StartWorkflowRequest to carry task_to_domain.

Verified live: GET_PODS_DATA and PULL_LOGS now COMPLETE end-to-end with real
viz-stage data. Adds deterministic regression tests (fake client) for both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Validated end-to-end against ah5r-prod (viz-stage Redis-critical alert): the
agent now autonomously reads the health-check data and pulls server + worker
logs to reach a root-cause hypothesis.

Fixes found during the live run:
- runtime_compat: on macOS, run conductor tool-workers as THREADS, not forked
  processes. Forked children segfault in getaddrinfo (Network.framework is not
  fork-safe) and 'spawn' can't pickle the worker's thread lock. agentspan ships
  a thread shim but gates it to Windows; reuse it on macOS. No-op on Linux
  (where fork is safe — how this runs in prod). Wired into triage + slack paths.
- get_incident_details: surface parse_conductor_cluster_data (redis.usage,
  decider_queue_size = running workflows, indexer_queue_size, heap, cpu,
  postgres) so the agent reads the queue numbers from the health-check JSON
  instead of deriving them via SQL.
- runbook: treat queue/usage as the symptom; find the cause in CONDUCTOR SERVER
  and WORKER pod logs (+ pod events). Explicitly forbid ad-hoc SQL on large
  tables like `workflow` (decider_queue_size already is the running-workflow
  count). run_sql_select is a last resort, not the primary tool.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…check alerts

- agent.py: add a compact alert-type playbook covering all 17 HealthIssue types
  (resource saturation, component-down, pod, networking, self-describing). Preserves
  the existing Redis/CPU/heap guidance; routes the model symptom -> evidence -> cite.
- tools.py: add read-only get_ingress_info (GET_INGRESS_INFO) for DNS / domain
  resolution / reachability alerts (empty ingress address = LB not provisioned).
- test_tools_dispatch.py: assert each tool dispatches its expected agent-handler
  command (incl. get_ingress_info) via a fake dispatcher. Validated fail-then-pass.
- test_tools_readonly.py: fix command names to match the AgentHandlerCommand enum
  (ROLLOUT_RESTART, KUBECTL_UNRESTRICTED, ...) so the guard actually catches them;
  add heavy/disruptive commands (DOWNLOAD_HEAP_DUMP, etc.) to the deny list.
…verage

Verified against the source workflow/worker (not assumed):
- get_incident_details ref "issues" matches health_check.json taskReferenceName, and
  the issues task output carries the per-issue severity+description text the playbook
  matches on (HealthCheckIssuesWorker) — the 17-type playbook is actually reachable.
- alert.py: the real Slack text is markdown (*`[CRITICAL]`* … _Org's_ env *`cluster`*)
  + emoji + appended execution URL. The old _CLUSTER_RE choked on the italic/bold markers
  and returned org/cluster=None. Strip *,` decoration and tolerate italic _ so org/cluster
  parse; execution id + severity were already fine. Added a test built from the exact
  worker+notify format; validated fail-then-pass.
- test_playbook_coverage.py: deterministic guard that all 17 HealthIssue types stay in the
  playbook; validated it fails when a type is dropped.
…ai.agents

The main merge into this branch (cd689ae) renamed the Python SDK package from
`agentspan` to `conductor` (dist `conductor-agent-sdk`), import root
`conductor.ai.agents`. oncall-agent still imported `agentspan.agents`, so the module
could not import its SDK at all on the post-merge branch. Swap the five SDK imports
(Agent, tool, AgentRuntime, worker_manager shim) to the new namespace.

Left the local `agentspan_server_url` config field + AGENTSPAN_SERVER_URL env var
as-is — they're our own naming, not SDK symbols, and the env var is documented.

Full suite passes (37) against the new namespace.
…spatcher

Runs the actual conductor.ai.agents AgentRuntime + agent reasoning + tool-calling against
the local server, with the Conductor dispatcher replaced by canned fixtures so it never
touches ah5r-prod. Validation is deterministic (CLAUDE.md): asserts every dispatched
command is read-only and the runtime returned output — does NOT LLM-judge the hypothesis.

Verified live: agent reads the incident first, then dispatches only read-only commands
(GET_CLUSTER_METRICS, GET_PODS_DATA, GET_POD_EVENTS, PULL_LOGS x2), follows the
Redis->decider-queue->logs playbook, and emits the Issue/Findings/Root-cause/Next-step
summary. Proves the SDK migration runs end-to-end and the read-only guard holds at runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant