bug(eval): At least one --evaluator or --evaluator-arn is required
[?25h returns "Failed to parse event body" for Strands agents — CLI sends raw CloudWatch spans without conversation log records

## Summary

`agentcore run eval` consistently fails with `"Failed to parse event body for span with ID: <span_id>"` when evaluating Strands-based agents. After tracing through the bedrock-agentcore SDK source code, I believe this is a mismatch between what the CLI sends to the evaluate API and what the backend evaluator expects.

## Environment

- `agentcore` CLI version: latest (Node.js, `@aws/agentcore`)
- `bedrock-agentcore` SDK version: `1.7.0`
- `strands-agents` version: `>=0.1.0` (1.37.0 installed)
- Agent framework: **Strands** (Python)
- AWS region: `us-east-1`

## Steps to Reproduce

1. Deploy a Strands-based agent to AgentCore Runtime
2. Invoke the agent once to produce a session with spans in CloudWatch (`aws/spans`)
3. Run `agentcore run eval --evaluator "Builtin.Correctness" --trace-id <trace_id>`

## Observed Behavior

```json
{
  "evaluator": "Builtin.Correctness",
  "aggregateScore": 0,
  "sessionScores": [
    {
      "sessionId": "a3498176-816d-462b-a038-8ad4e7236a50",
      "traceId": "69f15c222374220b7e1f00bc2f1dcaa6",
      "value": 0,
      "errorMessage": "Failed to parse event body for span with ID: 040303ac6912ea97"
    }
  ]
}
```

## Root Cause Analysis

I traced through the bedrock-agentcore SDK and found **two distinct evaluation paths** that produce different `sessionSpans` payloads:

### Path 1: Python SDK (`StrandsEvalsAgentCoreEvaluator`) — works

When using `convert_strands_to_adot()` on raw in-memory OTel spans, it:
- Calls `ADOTDocumentBuilder.build_span_document()` — which **omits the `events` array** from the span document
- Generates a **separate conversation log record** document with `body.input.messages` / `body.output.messages`
- Sends **both** to the evaluate API in `sessionSpans`

The backend reads the log record and evaluates successfully.

### Path 2: `agentcore run eval` CLI — broken

The CLI uses `CloudWatchAgentSpanCollector._fetch_spans()` which queries `aws/spans` (and the runtime log group) and sends the raw CloudWatch spans **directly** to the evaluate API without any transformation.

These CloudWatch spans **have the `events` array embedded** but **no separate conversation log records**. The backend tries to parse the span events and fails.

## Relevant Code

**`CloudWatchAgentSpanCollector._fetch_spans()`** (`agent_span_collector.py`):
```python
# Fetches raw ADOT spans from CloudWatch including the events array
aws_spans = self._helper.query_log_group(AWS_SPANS_LOG_GROUP, ...)
event_spans = self._helper.query_log_group(self.log_group_name, ...)
all_data = aws_spans + event_spans
return all_data  # Sent as-is to evaluate API — no conversation log records generated
```

**`ADOTDocumentBuilder.build_span_document()`** (`adot_models.py`):
```python
return {
    "resource": ..., "scope": ..., "traceId": ..., "spanId": ...,
    "attributes": attributes,
    "status": ...,
    # NOTE: no "events" key — events are intentionally dropped here
}
```

## Failing Span (from `aws/spans`)

The span that the backend fails to parse is the `invoke_agent Strands Agents` span. Its events look like:

```json
{
  "traceId": "69f15c222374220b7e1f00bc2f1dcaa6",
  "spanId": "040303ac6912ea97",
  "name": "invoke_agent Strands Agents",
  "events": [
    {
      "name": "gen_ai.user.message",
      "attributes": {
        "content": "[{\"text\": \"Is this email phishing? Subject: ...\"}]"
      }
    },
    {
      "name": "gen_ai.choice",
      "attributes": {
        "message": "{\"label\":\"phishing\", ...}\n",
        "finish_reason": "end_turn"
      }
    }
  ]
}
```

The `content` field is a JSON-encoded Anthropic message content array (`[{"text": "..."}]`), not a plain string. This is what `strands-agents` emits via `serialize(message["content"])`.

## Proposed Fix

The CLI's evaluation path should apply the same span transformation that the Python SDK does before sending to the evaluate API:

1. For each span with events, run it through `StrandsToADOTConverter.convert_span()` (or equivalent)
2. This drops the `events` from the span document and generates a separate conversation log record with `body.input.messages` / `body.output.messages`
3. Include **both** the transformed span documents **and** the conversation log records in `sessionSpans`

Alternatively, the backend evaluator could be updated to correctly parse the Strands event format (Anthropic content arrays) directly from span events.

## Workaround

Use the Python SDK evaluation path directly instead of the CLI:

```python
from bedrock_agentcore.evaluation.integrations.strands_agents_evals.evaluator import StrandsEvalsAgentCoreEvaluator
# ... invoke agent with in-memory OTel exporter, convert spans, evaluate
```

This bypasses CloudWatch entirely and works correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(eval): At least one --evaluator or --evaluator-arn is required [?25h returns "Failed to parse event body" for Strands agents — CLI sends raw CloudWatch spans without conversation log records #1006

Summary

Environment

Steps to Reproduce

Observed Behavior

Root Cause Analysis

Path 1: Python SDK (`StrandsEvalsAgentCoreEvaluator`) — works

Path 2: `agentcore run eval` CLI — broken

Relevant Code

Failing Span (from `aws/spans`)

Proposed Fix

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug(eval): At least one --evaluator or --evaluator-arn is required [?25h returns "Failed to parse event body" for Strands agents — CLI sends raw CloudWatch spans without conversation log records #1006

Description

Summary

Environment

Steps to Reproduce

Observed Behavior

Root Cause Analysis

Path 1: Python SDK (StrandsEvalsAgentCoreEvaluator) — works

Path 2: agentcore run eval CLI — broken

Relevant Code

Failing Span (from aws/spans)

Proposed Fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Path 1: Python SDK (`StrandsEvalsAgentCoreEvaluator`) — works

Path 2: `agentcore run eval` CLI — broken

Failing Span (from `aws/spans`)