Skip to content

bug(eval): At least one --evaluator or --evaluator-arn is required [?25h returns "Failed to parse event body" for Strands agents — CLI sends raw CloudWatch spans without conversation log records #1006

@john-kajabi

Description

@john-kajabi

Summary

agentcore run eval consistently fails with "Failed to parse event body for span with ID: <span_id>" when evaluating Strands-based agents. After tracing through the bedrock-agentcore SDK source code, I believe this is a mismatch between what the CLI sends to the evaluate API and what the backend evaluator expects.

Environment

  • agentcore CLI version: latest (Node.js, @aws/agentcore)
  • bedrock-agentcore SDK version: 1.7.0
  • strands-agents version: >=0.1.0 (1.37.0 installed)
  • Agent framework: Strands (Python)
  • AWS region: us-east-1

Steps to Reproduce

  1. Deploy a Strands-based agent to AgentCore Runtime
  2. Invoke the agent once to produce a session with spans in CloudWatch (aws/spans)
  3. Run agentcore run eval --evaluator "Builtin.Correctness" --trace-id <trace_id>

Observed Behavior

{
  "evaluator": "Builtin.Correctness",
  "aggregateScore": 0,
  "sessionScores": [
    {
      "sessionId": "a3498176-816d-462b-a038-8ad4e7236a50",
      "traceId": "69f15c222374220b7e1f00bc2f1dcaa6",
      "value": 0,
      "errorMessage": "Failed to parse event body for span with ID: 040303ac6912ea97"
    }
  ]
}

Root Cause Analysis

I traced through the bedrock-agentcore SDK and found two distinct evaluation paths that produce different sessionSpans payloads:

Path 1: Python SDK (StrandsEvalsAgentCoreEvaluator) — works

When using convert_strands_to_adot() on raw in-memory OTel spans, it:

  • Calls ADOTDocumentBuilder.build_span_document() — which omits the events array from the span document
  • Generates a separate conversation log record document with body.input.messages / body.output.messages
  • Sends both to the evaluate API in sessionSpans

The backend reads the log record and evaluates successfully.

Path 2: agentcore run eval CLI — broken

The CLI uses CloudWatchAgentSpanCollector._fetch_spans() which queries aws/spans (and the runtime log group) and sends the raw CloudWatch spans directly to the evaluate API without any transformation.

These CloudWatch spans have the events array embedded but no separate conversation log records. The backend tries to parse the span events and fails.

Relevant Code

CloudWatchAgentSpanCollector._fetch_spans() (agent_span_collector.py):

# Fetches raw ADOT spans from CloudWatch including the events array
aws_spans = self._helper.query_log_group(AWS_SPANS_LOG_GROUP, ...)
event_spans = self._helper.query_log_group(self.log_group_name, ...)
all_data = aws_spans + event_spans
return all_data  # Sent as-is to evaluate API — no conversation log records generated

ADOTDocumentBuilder.build_span_document() (adot_models.py):

return {
    "resource": ..., "scope": ..., "traceId": ..., "spanId": ...,
    "attributes": attributes,
    "status": ...,
    # NOTE: no "events" key — events are intentionally dropped here
}

Failing Span (from aws/spans)

The span that the backend fails to parse is the invoke_agent Strands Agents span. Its events look like:

{
  "traceId": "69f15c222374220b7e1f00bc2f1dcaa6",
  "spanId": "040303ac6912ea97",
  "name": "invoke_agent Strands Agents",
  "events": [
    {
      "name": "gen_ai.user.message",
      "attributes": {
        "content": "[{\"text\": \"Is this email phishing? Subject: ...\"}]"
      }
    },
    {
      "name": "gen_ai.choice",
      "attributes": {
        "message": "{\"label\":\"phishing\", ...}\n",
        "finish_reason": "end_turn"
      }
    }
  ]
}

The content field is a JSON-encoded Anthropic message content array ([{"text": "..."}]), not a plain string. This is what strands-agents emits via serialize(message["content"]).

Proposed Fix

The CLI's evaluation path should apply the same span transformation that the Python SDK does before sending to the evaluate API:

  1. For each span with events, run it through StrandsToADOTConverter.convert_span() (or equivalent)
  2. This drops the events from the span document and generates a separate conversation log record with body.input.messages / body.output.messages
  3. Include both the transformed span documents and the conversation log records in sessionSpans

Alternatively, the backend evaluator could be updated to correctly parse the Strands event format (Anthropic content arrays) directly from span events.

Workaround

Use the Python SDK evaluation path directly instead of the CLI:

from bedrock_agentcore.evaluation.integrations.strands_agents_evals.evaluator import StrandsEvalsAgentCoreEvaluator
# ... invoke agent with in-memory OTel exporter, convert spans, evaluate

This bypasses CloudWatch entirely and works correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions