feat(oncall-agent): read-only Slack-triggered cluster triage agent#293
Draft
manan164 wants to merge 10 commits into
Draft
feat(oncall-agent): read-only Slack-triggered cluster triage agent#293manan164 wants to merge 10 commits into
manan164 wants to merge 10 commits into
Conversation
Scaffolds an Agentspan agent that triages Orkes SaaS health-check alerts. It listens on the Slack alert channel, reads the failing health_check execution by id, and runs READ-ONLY agent-handler commands against the ah5r-prod Conductor API to investigate, then replies in-thread with a root-cause hypothesis. Advisory/dry-run only — no remediating actions. - sql_guard: deterministic SELECT-only guard (not LLM-trusted) for the run_sql_select tool; rejects DML/DDL, multi-statement, comment-smuggling. - conductor_client: dispatch read-only agent-handler workflows + poll; reads org/cluster/cloudEnvironmentTag off the failing execution so only the executionId is threaded into tools. - tools: 9 read-only investigation tools mapped to agent-handler commands. - agent: Claude triage loop with a component->investigation runbook. - slack_app: Socket Mode listener -> triage -> threaded reply. - tests: sql_guard + alert parsing, deterministic (no LLM), validated by proving each fails before passing per CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ention Follows the pattern in sdk/python/examples/91_slack_autofix_agent.py (per PR #135): poll the alert channel with conversations.history + reply via chat.postMessage using a bot token only — no slack_bolt / Socket Mode, no app-level token. Run-once or --loop, dedup via a local state file. Slack I/O lives in a deterministic poller; the triage agent stays pure (investigates a given execution id). Adds test_poller.py covering alert-only triage, cross-poll dedup, and failure reporting (fakes, no network/LLM; validated fail-then-pass per CLAUDE.md). Config: drop SLACK_APP_TOKEN; add SLACK_ALERT_CHANNEL (required), ONCALL_POLL_INTERVAL, ONCALL_STATE_FILE. requirements: drop slack-bolt, add requests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- test_tools_readonly: source-level invariant that no mutating/privileged agent-handler command is wired into tools.py, and SQL goes through the SELECT guard. Validated by adding DELETE_POD and confirming failure. - scripts/smoke_dispatch.py: LLM-free live check against ah5r-prod — reads the failing execution's cluster context, dispatches read-only commands (get_pods_data, get_cluster_metrics, SELECT 1), asserts COMPLETED + output shape. The L1 verification step; run with the Conductor app key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r domain Two bugs found during the first live run against ah5r-prod (viz-stage): 1. cloudEnvironmentTag is NOT in the health_check workflow input (it's produced by prepare_agent_handler's output), but sql_conductor reads it from workflow.input — derive it as c<orgId[:5]>-<clusterName> when absent. 2. Dispatched commands set no task_to_domain, so customer-cluster tasks (e.g. collect_metrics) sat in the default queue and the in-cluster agent never polled them -> TIMED_OUT. Mirror the control plane: wildcard "*" -> the cluster domain (orgId#-#clusterName), with orchestration tasks pinned to NO_DOMAIN. Switched dispatch to StartWorkflowRequest to carry task_to_domain. Verified live: GET_PODS_DATA and PULL_LOGS now COMPLETE end-to-end with real viz-stage data. Adds deterministic regression tests (fake client) for both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Validated end-to-end against ah5r-prod (viz-stage Redis-critical alert): the agent now autonomously reads the health-check data and pulls server + worker logs to reach a root-cause hypothesis. Fixes found during the live run: - runtime_compat: on macOS, run conductor tool-workers as THREADS, not forked processes. Forked children segfault in getaddrinfo (Network.framework is not fork-safe) and 'spawn' can't pickle the worker's thread lock. agentspan ships a thread shim but gates it to Windows; reuse it on macOS. No-op on Linux (where fork is safe — how this runs in prod). Wired into triage + slack paths. - get_incident_details: surface parse_conductor_cluster_data (redis.usage, decider_queue_size = running workflows, indexer_queue_size, heap, cpu, postgres) so the agent reads the queue numbers from the health-check JSON instead of deriving them via SQL. - runbook: treat queue/usage as the symptom; find the cause in CONDUCTOR SERVER and WORKER pod logs (+ pod events). Explicitly forbid ad-hoc SQL on large tables like `workflow` (decider_queue_size already is the running-workflow count). run_sql_select is a last resort, not the primary tool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…check alerts - agent.py: add a compact alert-type playbook covering all 17 HealthIssue types (resource saturation, component-down, pod, networking, self-describing). Preserves the existing Redis/CPU/heap guidance; routes the model symptom -> evidence -> cite. - tools.py: add read-only get_ingress_info (GET_INGRESS_INFO) for DNS / domain resolution / reachability alerts (empty ingress address = LB not provisioned). - test_tools_dispatch.py: assert each tool dispatches its expected agent-handler command (incl. get_ingress_info) via a fake dispatcher. Validated fail-then-pass. - test_tools_readonly.py: fix command names to match the AgentHandlerCommand enum (ROLLOUT_RESTART, KUBECTL_UNRESTRICTED, ...) so the guard actually catches them; add heavy/disruptive commands (DOWNLOAD_HEAP_DUMP, etc.) to the deny list.
…verage Verified against the source workflow/worker (not assumed): - get_incident_details ref "issues" matches health_check.json taskReferenceName, and the issues task output carries the per-issue severity+description text the playbook matches on (HealthCheckIssuesWorker) — the 17-type playbook is actually reachable. - alert.py: the real Slack text is markdown (*`[CRITICAL]`* … _Org's_ env *`cluster`*) + emoji + appended execution URL. The old _CLUSTER_RE choked on the italic/bold markers and returned org/cluster=None. Strip *,` decoration and tolerate italic _ so org/cluster parse; execution id + severity were already fine. Added a test built from the exact worker+notify format; validated fail-then-pass. - test_playbook_coverage.py: deterministic guard that all 17 HealthIssue types stay in the playbook; validated it fails when a type is dropped.
…ai.agents The main merge into this branch (cd689ae) renamed the Python SDK package from `agentspan` to `conductor` (dist `conductor-agent-sdk`), import root `conductor.ai.agents`. oncall-agent still imported `agentspan.agents`, so the module could not import its SDK at all on the post-merge branch. Swap the five SDK imports (Agent, tool, AgentRuntime, worker_manager shim) to the new namespace. Left the local `agentspan_server_url` config field + AGENTSPAN_SERVER_URL env var as-is — they're our own naming, not SDK symbols, and the env var is documented. Full suite passes (37) against the new namespace.
…spatcher Runs the actual conductor.ai.agents AgentRuntime + agent reasoning + tool-calling against the local server, with the Conductor dispatcher replaced by canned fixtures so it never touches ah5r-prod. Validation is deterministic (CLAUDE.md): asserts every dispatched command is read-only and the runtime returned output — does NOT LLM-judge the hypothesis. Verified live: agent reads the incident first, then dispatches only read-only commands (GET_CLUSTER_METRICS, GET_PODS_DATA, GET_POD_EVENTS, PULL_LOGS x2), follows the Redis->decider-queue->logs playbook, and emits the Issue/Findings/Root-cause/Next-step summary. Proves the SDK migration runs end-to-end and the read-only guard holds at runtime.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(oncall-agent): read-only Slack-triggered cluster triage agent
What this is
An Agentspan agent that triages Orkes SaaS cluster health-check alerts. It polls the Slack
alert channel (Web API, no Socket Mode), parses the
executionIdfrom the failinghealth_checkexecution URL, runs read-only agent-handler commands against the ah5r-prodConductor API to investigate, and replies in-thread with a root-cause hypothesis.
It is strictly read-only and advisory — it never takes a remediating action. SQL is gated by
a deterministic SELECT-only guard (
sql_guard.py), not by trusting the model. The agent readsorganizationId/clusterName/cloudEnvironmentTagoff the failing execution, so the LLMonly threads the
executionIdinto each tool — it cannot target the wrong cluster and no secretspass through tool args.
Done ✅
conversations.history+chat.postMessage), bot tokenonly, run-once /
--loop, state-file dedup. Matchessdk/python/examples/91_slack_autofix_agent.py.alert.py) — extracts execution id + severity + org/cluster from the alert text.conductor_client.py) — starts agent-handler command workflows onah5r-prod via conductor-python (app key/secret), polls to completion, derives + caches cluster context.
tools.py) —get_incident_details,get_cluster_metrics,get_infrastructure_metrics,get_pods_data,get_deployments_info,get_pod_events,get_top_output,pull_pod_logs,get_ingress_info,run_sql_select(SELECT-guarded).sql_guard.py) — SELECT/WITH/EXPLAIN/SHOW only; rejects every mutation,multi-statement, and comment-smuggling case before it reaches the DB.
agent.py) — symptom →evidence-to-gather → what-to-cite, matched off the
issuestext. Keeps the strong Redis /decider-queue / CPU / heap guidance and extends to component-down, pod, networking, and
self-describing alerts (table below).
CLAUDE.md) —test_sql_guard.py,test_alert.py,test_poller.py,test_conductor_client.py,test_tools_readonly.py(read-only safety guard, pinned to the real
AgentHandlerCommandenum names),test_tools_dispatch.py(each tool dispatches its expected agent-handler command). 35 passing.python -m oncall_agent.main [triage <execId>]),.env.example, README.To do 🚧
Agentspan server running with
ANTHROPIC_API_KEY. Not yet exercised live.before trusting it. Tune the playbook from real misses.
HITL approval gate.
The 17 health-check alert types — all now have a playbook
Source of truth:
HealthIssueenum inorkes-saas/.../worker/HealthCheckIssuesWorker.java. "Approach" = howagent.pytriages it.REDIS_CRITICAL_USAGEREDIS_HIGH_USAGECONDUCTOR_HIGH_HEAP_USAGECONDUCTOR_HIGH_CPU_USAGECONDUCTOR_ERROR_LOGS_COUNT_EXCEEDED_THRESHOLDCONDUCTOR_WARN_LOGS_COUNT_EXCEEDED_THRESHOLDCONDUCTOR_HEALTHY(failed)WORKERS_HEALTHY(failed)PROMETHEUS_NOT_RUNNINGPOD_NOT_RUNNINGPOD_RESTARTEDDNS_HEALTHY(failed)get_ingress_info— no address ⇒ LB unprovisioned; else external → infraDOMAIN_RESOLUTIONget_ingress_info→ resolve vs escalate to infraDOMAIN_REACHABILITYget_ingress_info→ reachable vs escalate to infraAUTH_STALEDOMAIN_CERTIFICATE_WILL_EXPIRERESPONSE_TIMEOn testing the playbook: the playbook is prompt text — by
CLAUDE.mdrule 1 it can't beLLM-judged in unit tests and isn't deterministically assertable, so it has no unit test by design.
What is tested deterministically: the read-only safety guard and the per-tool dispatch contract
(
get_ingress_info→GET_INGRESS_INFO, mutating commands stay unreachable). Playbook quality isvalidated in the eval step above.
Testing
Out of scope (v1)
Triage + read-only investigation only. No remediation (restart/scale/rollback).