Skip to content

Let agent signal step failure via a tagged block#2045

Merged
zentron merged 6 commits into
mainfrom
robert/agent-fail-deployment-signal
Jun 30, 2026
Merged

Let agent signal step failure via a tagged block#2045
zentron merged 6 commits into
mainfrom
robert/agent-fail-deployment-signal

Conversation

@zentron

@zentron zentron commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Background

The AI agent step runs the Claude CLI, which always exits 0 — even when the agent semantically failed at the task. That left a gap: when a workflow author writes "fail the deployment if the smoke test doesn't pass", the agent could detect the condition but had no way to make Octopus mark the step as failed. ClaudeAgentOutcomeEvaluator only inspected the process exit code and the CLI's structured result (is_error, subtype, permission denials), none of which capture an intentional failure.

Results

Adds a deterministic agent→managed-code failure signal:

  • New octopus-fail-deployment skill — instructs the agent, when the user expressed a failure condition that's been met, to emit an <octopus-task-failed>…</octopus-task-failed> block with an operator-facing reason. Absence of the block = success (unchanged default).
  • ClaudeAgentOutcomeEvaluator — scans the final result text for the block and throws a CommandException with the captured reason (generic fallback when empty), checked before the generic CLI-status checks so an intentional failure surfaces a clear message.

Design notes:

  • Uses a stdout tag rather than a marker file, so it works even when the agent is sandboxed without write permissions.
  • The required closing tag doubles as a completeness guard — a truncated, unclosed block won't match, so it can't be mistaken for a deliberate failure.
  • Tag wording is task-neutral (not deployment-specific) so it reads correctly for runbooks too.

No Server change required — the failure propagates through the existing non-zero-exit path.

Testing

ClaudeAgentOutcomeEvaluatorFixture — 14 unit tests passing, including reason capture, multi-line reasons, empty/self-closing blocks, a block embedded in larger output, precedence over a non-success subtype, and that an unclosed (truncated) block does not fail the step.

How to review

Core logic is the regex + check in ClaudeAgentOutcomeEvaluator.cs; the skill markdown is the agent-facing contract. The rest is test coverage.

🤖 Generated with Claude Code

Resolves: #MD-2151

zentron and others added 2 commits June 26, 2026 17:33
A Claude CLI run always exits 0, so a user-requested failure condition
("fail the deployment if the health check is red") was undetectable from
the outside. Add an octopus-fail-deployment skill that has the agent emit
an <octopus-task-failed> block, and have ClaudeAgentOutcomeEvaluator scan
the result for it and fail the step with the captured reason.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
catch (Exception installException)
{
Console.Error.WriteLine("Running rollback behaviours...");
log.Verbose("Running rollback behaviours...");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was annoying as hell. There is no rollback behaviour taking place, so why make it so prominent.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a deterministic mechanism for the AI agent to intentionally fail an Octopus step (despite the Claude CLI exiting 0) by emitting a tagged <octopus-task-failed>...</octopus-task-failed> block, which is then detected and converted into a managed-code failure.

Changes:

  • Added a new agent skill (octopus-fail-deployment) that defines the contract for signalling an intentional failure via a tagged block.
  • Updated ClaudeAgentOutcomeEvaluator to scan agent output for the failure tag and throw a CommandException with the captured reason.
  • Extended ClaudeAgentOutcomeEvaluator unit tests to cover the new failure signal behavior.
  • Switched rollback messaging in PipelineCommand from direct Console.Error writes to ILog.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
source/Calamari.Common/Plumbing/Pipeline/PipelineCommand.cs Routes rollback diagnostics through ILog instead of writing directly to stderr.
source/Calamari.AiAgent/ClaudeCodeBehaviour/DefaultContext/Skills/octopus-fail-deployment.md Defines the agent-facing failure-signal contract and formatting rules.
source/Calamari.AiAgent/ClaudeCodeBehaviour/ClaudeAgentOutcomeEvaluator.cs Detects the <octopus-task-failed> block and fails the step with a clear reason.
source/Calamari.AiAgent.Tests/ClaudeCodeBehaviour/ClaudeAgentOutcomeEvaluatorFixture.cs Adds unit coverage for the new intentional-failure signal parsing/precedence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/Calamari.Common/Plumbing/Pipeline/PipelineCommand.cs
Comment thread source/Calamari.Common/Plumbing/Pipeline/PipelineCommand.cs
}
catch (Exception rollbackException)
{
Console.Error.WriteLine(rollbackException);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should generally not be writing to Console in Calamari as it makes testing more difficult

Align the octopus-fail-deployment skill spec and code comment with the matcher, which accepts a self-closing <octopus-task-failed/> as a reason-less failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@eddymoulton eddymoulton left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits, but looks good

Comment thread source/Calamari.AiAgent/ClaudeCodeBehaviour/ClaudeAgentOutcomeEvaluator.cs Outdated
## Rules

- Emit the block **only** when the user expressed a failure condition AND you have determined it is met. If the condition was not met, say nothing special and let the step succeed.
- Always write a **complete** block — either a paired block ending in `</octopus-task-failed>` or a self-closing `<octopus-task-failed/>`. A closed tag is how Octopus confirms the message is whole — if you open the block but stop before closing it, the failure will not be detected, so finish the block before ending your turn.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: self closing block

I would be tempted to remove that as an option. Makes this whole thing more consistent and the matching regex simpler.

I'm a bit concerned about removing flexibility that Claude might like to take advantage of however.
I'll leave it with you to decide if you think cutting that down makes things simpler enough to warrant the change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting you say that. I originally had it not allow closing tags but then Claude pointed out it might make it more likely to use it when there happens to be no specific reason to apply.
Ill leave it in for now, but if it creates any false positives/negatives we can reconsider.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude pointed out it might make it more likely to use it when there happens to be no specific reason to apply

That's reason enough if Claude thinks it might be a problem without it.

…Evaluator.cs

Co-authored-by: Eddy Moulton <8491021+eddymoulton@users.noreply.github.com>
@zentron zentron enabled auto-merge (squash) June 30, 2026 05:46
@zentron zentron merged commit 1bf6fcf into main Jun 30, 2026
35 checks passed
@zentron zentron deleted the robert/agent-fail-deployment-signal branch June 30, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants