Skip to content

Feature/fix webhook execution token#284

Open
NicholasDCole wants to merge 19 commits into
mainfrom
feature/fix_webhook_execution_token
Open

Feature/fix webhook execution token#284
NicholasDCole wants to merge 19 commits into
mainfrom
feature/fix_webhook_execution_token

Conversation

@NicholasDCole

Copy link
Copy Markdown
Contributor

No description provided.

bradyyie and others added 19 commits June 9, 2026 17:14
For embedding conductor-agentspan into a host (orkes-conductor); default off,
so standalone behavior is unchanged.

- gate HTTP/HUMAN/JOIN/MCP task overrides behind agentspan.embedded so the host's
  native tasks win when embedded
- defer provider validation to the host when embedded (host owns integrations/secrets)
- AgentService.listAgents() uses portable getAllWorkflowDefs() (orkes oss-core lacks
  getAllWorkflowDefsLatestVersions)

Spike branch off server_split (PR #271). See orkes-conductor branch
agentspan-embed-spike: AGENTSPAN_EMBED_SPIKE_HANDOFF.md.
…ive secret syntax

- extract SkillMetadataDAO SPI; FileSystemSkillMetadataDAO default in conductor-agentspan-server
  (OSS on-disk layout unchanged; SkillRegistryService repointed to the SPI; test updated)
- AgentService pre-allocates the execution id via Conductor's IDGenerator, so hosts that derive
  createTime from the id (orkes time_based -> v1 UUID) get a real timestamp; fixes Conductor-UI
  visibility (archive returned 0 for random v4 ids)
- compiler emits host-native ${workflow.secrets.NAME} for HTTP/MCP headers when embedded
  (EmbeddedMode), else #{NAME} for the standalone credential-aware tasks
- un-gate CredentialAwareMcpService (conductor-ai 3.30.2 ctor matches; self-gates on __agentspan_ctx__)
- gate SecretController behind agentspan.embedded (host owns /api/secrets)
…orization, createdBy attribution, auto-config gating

- CLI: browser PKCE loopback login (default, discovers Auth0 from /context.js
  incl. real orkes quoted-key format), --device grant, --key-id/--key-secret
  orkes service-account grant; token in ~/.agentspan/token sent as
  X-Authorization with refresh/re-mint; --server persisted as session default;
  login prompts for instance URL on a TTY; credentials/login use getConfig()
- SDK: all HTTP paths (sync agent API+compile, sync SSE, async http_client,
  langchain/langgraph/claude_agent_sdk adapters) mint a JWT from
  AGENTSPAN_AUTH_KEY/SECRET via POST /token and send X-Authorization
  (shared cached helper _internal/token_utils.py); replaces orkes-unknown
  X-Auth-Key/X-Auth-Secret headers; DO_WHILE iteration fallback for guardrail/
  termination/stop_when workers (TaskContext.task.iteration)
- library: AgentSpanAutoConfiguration gated standalone-or-embedded (host gets
  stock conductor unless agentspan.embedded=true); AgentService stamps stock
  StartWorkflowRequest.createdBy from the RequestContext principal (root +
  sub-workflow attribution, task _createdBy stamping, poll impersonation);
  KnownProviderEnvVars shared env-seeding list (standalone seeder repointed)
…d system task

The previous fix minted execution tokens for webhook/UI/schedule-started
workflows inside AgentEventListener.onWorkflowStarted by mutating the
WorkflowModel input. That works on local Conductor but is not durable on
orkes-conductor, which does not persist a status-listener's workflow
mutations — so the minted token was lost before the first credentialed
task ran.

Replace it with AGENTSPAN_MINT_TOKEN, a WorkflowSystemTask the compiler
injects as the first task of every agent workflow (root and recursively
compiled sub-workflows, excluding framework-passthrough and graph-structure
defs). The task:
  - passes through an existing __agentspan_ctx__ (SDK /agent/start path, or
    a sub-workflow inheriting it from its parent's SUB_WORKFLOW input), else
  - mints a fresh token from the identity + declared-credential allow-list
    stamped on the WorkflowDef metadata at deploy time.

The resulting ctx is written to workflow.variables.__agentspan_ctx__ and
persisted via ExecutionDAOFacade.updateWorkflow — the same durable channel
SET_VARIABLE uses — so it survives every decide() reload on every core.

Downstream tasks (sub-workflow inputs, LLM tasks, HTTP/credential-aware
tool fetches, enrich scripts) now read ${workflow.variables.__agentspan_ctx__}
instead of ${workflow.input.__agentspan_ctx__} across AgentCompiler,
MultiAgentCompiler, and ToolCompiler.

Remove the now-dead ensureExecutionToken path and its tests from
AgentEventListener, add AgentspanMintTokenTaskTest, and update the compiler
and E2E tests to account for the injected mint task.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The local worker's WorkerCredentialFetcher resolves secrets via
POST /api/workers/secrets. It authenticated by sending the raw key id
(AGENTSPAN_AUTH_KEY) as `Authorization: Bearer`, which an orkes-conductor
host's API gateway rejects with INVALID_TOKEN ("Token cannot be null or
empty") — so every credentialed worker tool (e.g. the Slack delivery
worker) failed even when the execution token was present in the task input.

Mint an orkes session token from the key/secret pair (POST /api/token) and
send it as the `X-Authorization` header, caching it and refreshing once on
a 401. When no key/secret pair is configured (standalone agentspan), fall
back to the existing `Authorization: Bearer` api-key behavior, where the
execution token in the request body is the only credential required.

Wire auth_key/auth_secret from AgentConfig into the fetcher in _dispatch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	.github/workflows/ci.yml
#	.github/workflows/release-server-maven.yml
#	sdk/java/docs/api-reference.md
#	sdk/java/docs/concepts/agents.md
#	sdk/java/docs/spring-boot.md
#	sdk/java/e2e/Suite12HandoffApprove.java
#	sdk/java/spring/src/main/java/org/conductoross/conductor/ai/spring/AgentAutoConfiguration.java
#	sdk/java/src/main/java/org/conductoross/conductor/ai/Agent.java
#	sdk/java/src/main/java/org/conductoross/conductor/ai/internal/AgentConfigSerializer.java
#	sdk/java/src/main/java/org/conductoross/conductor/ai/model/AgentEvent.java
#	server/build.gradle
#	server/conductor-agentspan-server/src/main/java/dev/agentspan/runtime/service/skill/FileSystemSkillMetadataDAO.java
#	server/conductor-agentspan-server/src/test/java/dev/agentspan/runtime/service/AgentHumanTaskTest.java
#	server/conductor-agentspan/src/main/java/dev/agentspan/runtime/controller/CredentialMaskingResponseAdvice.java
#	server/conductor-agentspan/src/main/java/dev/agentspan/runtime/service/AgentService.java
…{workflow.secrets.NAME}; rip out standalone credential machinery

When embedded in a host (orkes-conductor), AgentSpan now resolves NO
credentials itself. Every credential need — LLM provider keys, MCP/HTTP
headers, and SDK worker-tool secrets — is expressed as a Conductor-native
${workflow.secrets.NAME} reference physically present in task input, which
the host resolves just-in-time (never persisting plaintext) exactly as it
does for any other workflow. Standalone (OSS) is deliberately non-secure:
no credential store, no secrets API, no in-process resolution — worker tools
simply receive no secrets. Security exists only inside a host that owns a
secret store.

Design: docs/design/2026-06-25-worker-secret-delivery-design.md

WHY (history): this branch went through four prior iterations —
(1) status-listener execution-token mint (orkes does not persist a
status-listener's workflow mutations, so the token was lost before the first
task ran — the original "webhook execution token" bug), (2) an
AGENTSPAN_MINT_TOKEN system task injected as the first node (visible
distracting node; still a bespoke capability token), (3) a pull-based
/api/workers/secrets endpoint + an SDK fetcher in four languages, and
(4) WorkerSecretPollAdvice poll-time injection (works only when AgentSpan owns
the poll path; in a real orkes-embedded deployment the HOST serves the poll, so
the advice never runs and the SDK worker path got no secrets). This commit is
iteration 5: stamp ${workflow.secrets.NAME} references at compile/schedule time
and let the host substitute them natively. All four prior mechanisms are removed.

REFERENCE-STAMPING SITES (all at compile/schedule time, embedded-gated):
- LLM provider keys (AgentChatCompleteTaskMapper + new LlmProviderEnv): when
  embedded, stamp apiKey = "${workflow.secrets.<PROVIDER_KEY>}" into the
  LLM task input (provider -> env-var map, e.g. openai -> OPENAI_API_KEY).
  AgentspanAIModelProvider.getModel() reads the host-resolved apiKey from task
  input and builds a fresh model with it. Only the REQUIRED key is stamped;
  base_url and Gemini project id are not auto-stamped (orkes hard-fails on a
  missing secret reference and those are optional). Standalone stamps nothing
  and falls back to System.getenv / startup-configured model.
- MCP / HTTP headers (ToolCompiler.rewriteCredentialPlaceholders): embedded
  rewrites the SDK's ${NAME} header placeholders to ${workflow.secrets.NAME};
  these run as in-process system tasks so the host substitutes before the call.
- SDK worker-tool secrets (enrich script + static sites): each SIMPLE
  worker-tool task gets inputParameters.__resolved_credentials__ =
  { NAME: "${workflow.secrets.NAME}" }. Dynamic path —
  ToolCompiler.buildEnrichTask / buildEnrichTaskDynamic build a workerCredCfg
  map (gated on EmbeddedMode.isEmbedded()), serialize it into the GraalJS enrich
  script, and the SIMPLE branch injects it. Static paths —
  AgentCompiler.compileFrameworkPassthrough and compilePrefillTasks stamp the
  map directly. Per-tool names come from AgentCompiler.collectToolCredentials
  (tool's own credentials with agent-level fallback), plumbed into ToolCompiler
  via setWorkerCreds to preserve the fallback. HTTP/MCP tools excluded (their
  secrets travel as headers). The host resolves __resolved_credentials__ at poll
  time via prepareTaskWithSecrets (recursive Map/List secret substitution).

SDK CONTRACT (unchanged across Python/TypeScript/Java/C#): the worker reads
__resolved_credentials__ from task input, injects the values into the env /
credential context, and strips the key before invoking the tool. The bespoke
WorkerCredentialFetcher / credential fetcher modules are deleted in all SDKs
since secrets now arrive inline in task input.

REMOVED (standalone machinery — "rip out entirely"):
- server controllers: SecretController (/api/secrets CRUD), WorkerController
  (/api/workers/secrets).
- credentials services/tasks: CredentialResolutionService, CredentialAwareHttpTask
  (+Config), CredentialAwareMcpService (upstream MCPService @component takes over,
  receiving already-resolved headers), AgentspanMintTokenTask, ExecutionTokenService,
  KnownProviderEnvVars.
- store/encryption: EncryptedDbCredentialStoreProvider, MasterKeyConfig,
  CredentialEnvSeeder, CredentialSchemaMigrator, CredentialDataSourceConfig,
  CredentialStoreProvider SPI, schema-credentials(.|-postgres).sql.
- models: CredentialMeta, ResolveRequest, ResolveResponse.
- SDK fetchers: Java WorkerCredentialFetcher, Python credentials/fetcher.py.
- all associated tests for the deleted classes.

NOTE: includes HANDOFF-worker-secret-embedded.txt planning notes (scratch).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lity

Sharpen the design doc to reflect the actual deployment model. The prior text
framed it as a binary "embedded (orkes-conductor) gets secrets" vs "standalone
(OSS) does not", which conflated two different things.

Corrections:
- Secret resolution is Orkes-Conductor's EXISTING mechanism
  (TaskResource.prepareTaskWithSecrets, OrkesWorkflowExecutor.substituteSecret,
  PostgresSecretsDAO). AgentSpan contributes NOTHING custom — it only emits
  Conductor-native ${workflow.secrets.NAME} references and relies entirely on
  the host to substitute them.
- That mechanism does NOT exist in OSS Conductor. The EmbeddedMode.isEmbedded()
  gate stamps references whenever agentspan.embedded=true, but only
  Orkes-Conductor resolves them. So there are three deployments, not two:
  Orkes-embedded -> secrets work; OSS-Conductor-embedded -> references stamped
  but never substituted, no secrets; standalone -> references not even stamped,
  no secrets. Both non-Orkes modes are non-secure by design.

Updated TL;DR (added a deployment matrix), the guiding principle, §2 lead-in,
§6 security model, and §8 rationale.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite from scratch: tighter summary + deployment matrix up front, condensed
the Orkes-resolution and compiler-stamping sections, collapsed the iteration
history into a short list. Same content and conclusions, far less prose.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lt in performance gain from 1 less JS task .

Fix spotless checks

Add null-default handling for compiled child-agent tools.... causing errors in UI
…ials` CLI

The credential redesign (commit fa64a9c) deleted the standalone secret store
and the server's /api/secrets endpoint; secrets are now delegated entirely to
the Orkes host via ${workflow.secrets.NAME}. The e2e environment (and any
standalone deploy) therefore has NO secret backend — no way to write a secret
and have a running tool receive its value. The e2e suites that did exactly that
(via the CLI `credentials` command, or Java PUT /api/secrets) failed with
HTTP 404. This commit aligns the tests and CLI with the new architecture.

Decision: trim (keep credential-independent coverage), and remove the now-dead
CLI command entirely.

DEAD CLI REMOVAL (cli/):
- Deleted cli/cmd/credentials.go (+ test), cli/tui/views/credentials.go, and the
  stale 05_credentials TUI snapshot. The `agentspan credentials` subcommand no
  longer exists.
- cli/tui/app.go + nav.go: removed the Credentials view field/init, its message/
  key/render cases, the ViewCredentials nav item and iota; renumbered the sidebar
  and number shortcuts (Doctor/Configure/Skills shifted up). Snapshots regenerated.
- cli/client/client.go: removed CredentialMeta + ListCredentials/SetCredential/
  DeleteCredential. cli/client/client_test.go + tui tests updated accordingly.
- cli/examples/github.go left as-is (only a cosmetic tag label, no /api/secrets).
- Verified: go build ./..., go vet ./..., go test ./... all green.

E2E TRIMS (keep coverage, drop secret-injection):
- Python (sdk/python/e2e): removed CredentialsCLI + cli_credentials fixture from
  conftest. suite2 rewritten to two tests — a no-credential tool COMPLETES, and a
  credential-requiring tool with no backend FAILS terminally without leaking env
  vars (the "env is not a silent fallback" guarantee kept). suite4/suite5 keep
  Phase-1 (unauthenticated discovery + execution); Phase-2 auth removed.
  suite3 cli_credential_lifecycle replaced with the credential-free command
  whitelist coverage. suite5 external-OpenAPI test unchanged.
- TypeScript (sdk/typescript/tests/e2e): removed credentialSet/credentialDelete +
  CLI_PATH from helpers. Same trims as Python across suite2/3/4/5.
- Java (sdk/java/e2e): Suite2ToolCallingCredentials rewritten — dropped putSecret/
  deleteSecret helpers and the set/update steps; kept a no-credential-completes
  test and a credential-required-fails-without-backend test. Suite4McpTools
  unchanged (its credential tests are declaration/serialization only).

All retained assertions are deterministic status/output checks (no LLM
validation, per project rule). Builds/lints/compiles verified green per language;
the live e2e suites are validated by CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…flaky Python checks

Follow-up to the credential e2e trim. The first pass left two problems that CI
surfaced, and missed the C# SDK entirely:

Python (test_suite2_tool_calling.py):
- paid_tool_a read os.environ directly, so the "env is not a silent fallback"
  check defeated itself — setting the env var made the tool SUCCEED ('paid_a:fro'
  from 'from-env-aaa'). Switched it to the SDK accessor get_secret(), which reads
  the injected credential context and never consults os.environ. Now the trimmed
  secret correctly raises CredentialNotFoundError → task fails, and the env var
  cannot leak.
- The free-tool test asserted on the agent's free-form text (extraction returned
  "result"); switched to asserting the deterministic tool TASK output.
- Renamed the failure test to test_credential_required_tool_is_trimmed_and_fails;
  it now asserts on the tool task (not run status), that the success marker
  'paid_a:' is absent (secret genuinely not delivered), and accepts 'FAILED'.

C# (Suite2_ToolCalling.cs) — was never trimmed:
- Removed the set/update store steps (PutSecretAsync/DeleteSecretAsync) and the
  env-leak step (its tools read env directly, so setting env made them succeed).
- CredentialLifecycle_RuntimeInjection -> CredentialTrimmed_RequiredToolFails:
  free tool COMPLETES; paid tools FAIL with the secret trimmed and never emit
  their success marker. AssertToolTaskTerminal -> AssertToolTaskFailed, which
  accepts 'FAILED' alongside the terminal variants.

Design doc: documented what OSS Conductor actually does with the reference
(ParametersUtils resolves ${workflow.secrets.X} to null via SUPPRESS_EXCEPTIONS —
no secrets node), framed it as the intentional "trim", and added the e2e
contract (validate the expected failure; never inject values).

Java and TypeScript suite2 already passed (their tools use accessors that ignore
env) and are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-name collision)

test_no_credential_tool_completes matched the wrong task: the agent was named
"e2e_free_tool", and _find_tool_tasks_for matches tool tasks by substring on the
reference name, so "free_tool" matched the orchestration task
"e2e_free_tool_ctx_resolve" (output {'result': {}}) instead of the real free_tool
worker task. Renamed the agents to e2e_nocred_agent / e2e_reqcred_agent so no
agent name contains a tool name as a substring. The paid agent never collided
("paid_tool_a" is not a substring of "e2e_paid_tool_*"), which is why
test_credential_required_tool_is_trimmed_and_fails already passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…short=true metadata list)

The deploy/compile path stamped `WorkflowDef.timeoutPolicy = null` in two places
(`createWorkflow` and `applyTimeout`'s no-timeout branch). A null persists fine,
but the orkes `?short=true` metadata summary builder enum-parses the field
(`WorkflowDef.TimeoutPolicy.valueOf(...)`), so a null def 500s the entire
workflow/agent LIST page with "BACKEND_ERROR - Name is null" (the exact message
java.lang.Enum.valueOf throws on a null name).

This was deploy-path-only and invisible to a GET→PUT round-trip: conductor's
WorkflowDef defaults `timeoutPolicy` to ALERT_ONLY, so re-registering a fetched
def re-applies the non-null default and "fixes" the list — which is why the null
couldn't be found in any serialized def and why re-deploying via `--deploy`
reintroduced the break.

Fix: set `WorkflowDef.TimeoutPolicy.ALERT_ONLY` (conductor's own default) instead
of null at both sites. ALERT_ONLY alerts but does not terminate the workflow, so
this is behavior-preserving — the previous null meant no enforcement, and with
timeoutSeconds=0 the policy is moot anyway; the change exists purely so orkes can
enum-parse a non-null value.

Test: `AgentCompilerTest.testTimeoutPolicyNeverNull` asserts a compiled def's
timeoutPolicy is non-null (ALERT_ONLY). Validated by reintroducing the null at
both sites and confirming the test fails first.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…utPolicy

testMultiAgentNoTimeoutSetsNoPolicy asserted timeoutPolicy == null, which
encoded the bug just fixed (a null policy 500s orkes' ?short=true metadata list
via TimeoutPolicy.valueOf(null)). Renamed to testMultiAgentNoTimeoutKeepsNonNullPolicy
and updated it to assert the fixed contract: no timeout is enforced
(timeoutSeconds == 0) but the policy stays non-null (ALERT_ONLY).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants