Let agent signal step failure via a tagged block#2045
Conversation
A Claude CLI run always exits 0, so a user-requested failure condition
("fail the deployment if the health check is red") was undetectable from
the outside. Add an octopus-fail-deployment skill that has the agent emit
an <octopus-task-failed> block, and have ClaudeAgentOutcomeEvaluator scan
the result for it and fail the step with the captured reason.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| catch (Exception installException) | ||
| { | ||
| Console.Error.WriteLine("Running rollback behaviours..."); | ||
| log.Verbose("Running rollback behaviours..."); |
There was a problem hiding this comment.
This was annoying as hell. There is no rollback behaviour taking place, so why make it so prominent.
There was a problem hiding this comment.
Pull request overview
This PR adds a deterministic mechanism for the AI agent to intentionally fail an Octopus step (despite the Claude CLI exiting 0) by emitting a tagged <octopus-task-failed>...</octopus-task-failed> block, which is then detected and converted into a managed-code failure.
Changes:
- Added a new agent skill (
octopus-fail-deployment) that defines the contract for signalling an intentional failure via a tagged block. - Updated
ClaudeAgentOutcomeEvaluatorto scan agent output for the failure tag and throw aCommandExceptionwith the captured reason. - Extended
ClaudeAgentOutcomeEvaluatorunit tests to cover the new failure signal behavior. - Switched rollback messaging in
PipelineCommandfrom directConsole.Errorwrites toILog.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| source/Calamari.Common/Plumbing/Pipeline/PipelineCommand.cs | Routes rollback diagnostics through ILog instead of writing directly to stderr. |
| source/Calamari.AiAgent/ClaudeCodeBehaviour/DefaultContext/Skills/octopus-fail-deployment.md | Defines the agent-facing failure-signal contract and formatting rules. |
| source/Calamari.AiAgent/ClaudeCodeBehaviour/ClaudeAgentOutcomeEvaluator.cs | Detects the <octopus-task-failed> block and fails the step with a clear reason. |
| source/Calamari.AiAgent.Tests/ClaudeCodeBehaviour/ClaudeAgentOutcomeEvaluatorFixture.cs | Adds unit coverage for the new intentional-failure signal parsing/precedence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| catch (Exception rollbackException) | ||
| { | ||
| Console.Error.WriteLine(rollbackException); |
There was a problem hiding this comment.
We should generally not be writing to Console in Calamari as it makes testing more difficult
Align the octopus-fail-deployment skill spec and code comment with the matcher, which accepts a self-closing <octopus-task-failed/> as a reason-less failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
eddymoulton
left a comment
There was a problem hiding this comment.
Couple of nits, but looks good
| ## Rules | ||
|
|
||
| - Emit the block **only** when the user expressed a failure condition AND you have determined it is met. If the condition was not met, say nothing special and let the step succeed. | ||
| - Always write a **complete** block — either a paired block ending in `</octopus-task-failed>` or a self-closing `<octopus-task-failed/>`. A closed tag is how Octopus confirms the message is whole — if you open the block but stop before closing it, the failure will not be detected, so finish the block before ending your turn. |
There was a problem hiding this comment.
RE: self closing block
I would be tempted to remove that as an option. Makes this whole thing more consistent and the matching regex simpler.
I'm a bit concerned about removing flexibility that Claude might like to take advantage of however.
I'll leave it with you to decide if you think cutting that down makes things simpler enough to warrant the change.
There was a problem hiding this comment.
Interesting you say that. I originally had it not allow closing tags but then Claude pointed out it might make it more likely to use it when there happens to be no specific reason to apply.
Ill leave it in for now, but if it creates any false positives/negatives we can reconsider.
There was a problem hiding this comment.
Claude pointed out it might make it more likely to use it when there happens to be no specific reason to apply
That's reason enough if Claude thinks it might be a problem without it.
…Evaluator.cs Co-authored-by: Eddy Moulton <8491021+eddymoulton@users.noreply.github.com>
Background
The AI agent step runs the Claude CLI, which always exits
0— even when the agent semantically failed at the task. That left a gap: when a workflow author writes "fail the deployment if the smoke test doesn't pass", the agent could detect the condition but had no way to make Octopus mark the step as failed.ClaudeAgentOutcomeEvaluatoronly inspected the process exit code and the CLI's structured result (is_error,subtype, permission denials), none of which capture an intentional failure.Results
Adds a deterministic agent→managed-code failure signal:
octopus-fail-deploymentskill — instructs the agent, when the user expressed a failure condition that's been met, to emit an<octopus-task-failed>…</octopus-task-failed>block with an operator-facing reason. Absence of the block = success (unchanged default).ClaudeAgentOutcomeEvaluator— scans the final result text for the block and throws aCommandExceptionwith the captured reason (generic fallback when empty), checked before the generic CLI-status checks so an intentional failure surfaces a clear message.Design notes:
No Server change required — the failure propagates through the existing non-zero-exit path.
Testing
ClaudeAgentOutcomeEvaluatorFixture— 14 unit tests passing, including reason capture, multi-line reasons, empty/self-closing blocks, a block embedded in larger output, precedence over a non-success subtype, and that an unclosed (truncated) block does not fail the step.How to review
Core logic is the regex + check in
ClaudeAgentOutcomeEvaluator.cs; the skill markdown is the agent-facing contract. The rest is test coverage.🤖 Generated with Claude Code
Resolves: #MD-2151