Skip to content

feat: add agents/ orchestration framework for autonomous bug fixing and feature building#1124

Open
aidandaly24 wants to merge 12 commits intomainfrom
feat/agents-orchestration
Open

feat: add agents/ orchestration framework for autonomous bug fixing and feature building#1124
aidandaly24 wants to merge 12 commits intomainfrom
feat/agents-orchestration

Conversation

@aidandaly24
Copy link
Copy Markdown
Contributor

Description

Adds a self-contained Python project (agents/) for autonomous agents powered by Bedrock AgentCore Harness. This provides the shared orchestration infrastructure that the team's frontier week agents will build on.

What's included:

  • agents/core/ — shared harness client (raw HTTP + SigV4), response parsing, config
  • agents/orchestrations/fix_and_review/ — multi-phase pipeline: plan → execute → verify → multi-round review → fix → PR
  • agents/bug_fixer/ — workflow: issue labeled bug → agent plans fix → implements → reviews → PRs
  • agents/feature_builder/ — workflow: devex doc + impl plan → agent builds feature → reviews → PRs
  • agents/pr_reviewer/ — migrated from .github/harness/ to share core infrastructure
  • GitHub Actions workflows for both triggers
  • 19 unit tests

Tested end-to-end: Successfully planned, implemented, and reviewed fixes for issues #761 and #924 with Opus 4.7, creating PRs with proper templates through 3 rounds of multi-agent review.

Architecture: See the proposal doc (linked in Quip) for full details on the layered design: workflows → orchestrations → phases → core.

Related Issue

Part of Frontier Week: CLI/SDK autonomous agents initiative.

Type of Change

  • New feature

Testing

  • I ran npm run test:unit and npm run test:integ
  • I ran npm run typecheck
  • I ran npm run lint

The agents/ directory has its own test suite: cd agents && uv sync && uv run pytest tests/ -v (19 tests passing).

Checklist

  • I have added any necessary tests that prove my fix is effective or my feature works
  • My changes generate no new warnings

Adds a self-contained Python project for autonomous agents powered by
Bedrock AgentCore Harness. Includes:

- core/ — shared harness client (raw HTTP + SigV4), response parsing, config
- orchestrations/fix_and_review/ — multi-phase pipeline: plan → execute → verify → multi-round review → fix → PR
- bug_fixer/ — workflow entry point for fixing issues labeled 'bug'
- feature_builder/ — workflow entry point for building features from devex + impl docs
- pr_reviewer/ — migrated from .github/harness/ to share core infrastructure
- GitHub Actions workflows for both triggers
- 19 unit tests

Tested end-to-end: successfully planned, implemented, and reviewed fixes
for issues #761 and #924 with Opus 4.7, creating PRs with proper templates.
@aidandaly24 aidandaly24 requested a review from a team May 5, 2026 19:47
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 5, 2026
Comment thread agents/orchestrations/fix_and_review/orchestrator.py Fixed

from core.config import PipelineConfig
from core.harness_client import HarnessClient
from core.parsing import Finding
from orchestrations.fix_and_review.phases.aggregate import run_aggregate
from orchestrations.fix_and_review.phases.complete import run_complete
from orchestrations.fix_and_review.phases.execute import run_execute
from orchestrations.fix_and_review.phases.extract import ExtractResult, run_extract
import os
import tempfile

import pytest
@@ -0,0 +1,81 @@
import pytest

from core.parsing import Finding, ReviewResult, parse_reviewer_output
@@ -0,0 +1,74 @@
import pytest
Comment on lines +3 to +10
from orchestrations.fix_and_review.partitioning import (
DiffStats,
ReviewerAssignment,
calculate_reviewer_count,
partition_round1_by_directory,
partition_round2_focus_prompts,
partition_round3_risk_areas,
)
@github-actions github-actions Bot added the agentcore-harness-reviewing AgentCore Harness review in progress label May 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 43.1% 9015 / 20912
🔵 Statements 42.39% 9573 / 22582
🔵 Functions 39.91% 1553 / 3891
🔵 Branches 39.98% 5808 / 14527
Generated in workflow #2560 for commit 7026fd9 by the Vitest Coverage Report Action

Comment thread agents/config.yaml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow sets HARNESS_ARN secret but code never reads it; hardcoded personal ARN will be used in CI instead

Both .github/workflows/bug-fixer.yml and .github/workflows/feature-builder.yml export HARNESS_ARN: ${{ secrets.HARNESS_ARN }} as an env var, but nothing in agents/ reads environment variables — PipelineConfig only loads from config.yaml, and the workflows don't pass --harness-arn.

The hardcoded value in agents/config.yaml is arn:aws:bedrock-agentcore:us-west-2:603141041947:harness/IssueSolver_aidandal-8SL97TEXjS — a personal developer harness (and a personal AWS account ID checked into the repo). When these workflows run in CI they will always hit that personal harness and ignore the secret.

Fix options:

  1. Have PipelineConfig.from_yaml (or PipelineConfig itself) read HARNESS_ARN/AWS_PROFILE/etc. from env vars, with env taking precedence over YAML.
  2. Pass --harness-arn "$HARNESS_ARN" on the uv run python -m ... line in each workflow.
  3. Remove the account-specific ARN from the committed config.yaml (leave it as a placeholder/empty) and require env/flag override in non-local runs.

Option 1 is probably cleanest since it also solves the aws_profile issue below.

self.session = boto3.Session(
region_name=config.region,
profile_name=config.aws_profile,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws_profile="deploy" default will crash under GitHub Actions OIDC

Both workflows use aws-actions/configure-aws-credentials with role-to-assume — this sets AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN env vars, not a named profile. But PipelineConfig defaults aws_profile to "deploy", and HarnessClient.__init__ passes it unconditionally:

self.session = boto3.Session(region_name=config.region, profile_name=config.aws_profile)

In CI there is no deploy profile, so boto3 will raise ProfileNotFound the moment either agent starts.

Fix options:

  1. Treat aws_profile as optional (e.g. None default) and only pass profile_name= when it's set, so env-var credentials flow through naturally.
  2. Read AWS_PROFILE from env and default to None rather than "deploy".
  3. Pass an explicit --aws-profile override from the workflow (but there's no profile to point at in GH Actions, so this doesn't really work).

Option 1 or 2 is needed for the workflows to run at all.

Comment thread agents/pyproject.toml Outdated
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.14"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python version mismatch: requires-python = ">=3.14" vs workflow using Python 3.12

agents/pyproject.toml declares requires-python = ">=3.14", but both .github/workflows/bug-fixer.yml and .github/workflows/feature-builder.yml use actions/setup-python@v6 with python-version: '3.12'.

uv sync will refuse to use 3.12 for a project that requires 3.14. uv may auto-download a 3.14 interpreter, but this is fragile and 3.14 only released Oct 2025 — this is almost certainly unintended.

Fix options:

  1. Drop the requires-python constraint to >=3.12 (matches the workflow and the rest of your dependencies' support).
  2. Bump the workflow to python-version: '3.14'.

Option 1 is safer unless you actually depend on a 3.14-only feature.

3. Run tests with summary: `npm run test:unit 2>&1 | grep -E "(FAIL|PASS|Tests:|Test Suites:)" | tail -20`
4. If tests fail, debug the specific file: `npm run test:unit -- path/to/failing.test.ts 2>&1 | tail -50`
5. Commit your changes: `git add -A && git commit -m "feat: {commit_message}"`
6. Push to remote: `git push origin feature/{feature_name}`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{feature_name} placeholder will cause a KeyError — feature_builder Phase 2 is broken

This template references {feature_name} on line 12, but run_execute in orchestrations/fix_and_review/phases/execute.py only passes plan, commit_message, and branch_name to load_prompt("executor.md", ...). str.format will raise KeyError: 'feature_name' the first time the feature_builder pipeline hits Phase 2: Execute.

The PR description says this was tested end-to-end on issues #761 and #924, but those are bug-fix cases — the feature_builder path hasn't actually exercised this prompt.

Fix options:

  1. Change line 12 to use {branch_name} instead of feature/{feature_name} (matches bug_fixer/prompts/executor.md and is what run_execute already supplies).
  2. Pass feature_name=feature_name or "" through from the orchestrator into run_execute and into load_prompt.

Option 1 is the minimal fix.

if exit_code == 0 and stdout.strip():
pr_urls.append(stdout.strip())
else:
errors.append(f"Failed to create PR in {repo}: {stderr[:500]}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate PR-URL append + stale stderr/exit_code/stdout in the else branch

Lines 96–109 have two separate bugs from what looks like a bad merge / dead code that wasn't deleted:

if url_match:
    pr_urls.append(url_match.group(0))
else:
    stdout, _, _ = client.run_command(
        session_id, f"cd {repo_name} && gh pr list --head {branch_name} ..."
    )
    if stdout.strip():
        pr_urls.append(stdout.strip())
    else:
        errors.append(f"PR may have been created in {repo} but could not extract URL")
if exit_code == 0 and stdout.strip():       # <-- stale vars from previous iterations
    pr_urls.append(stdout.strip())           # <-- double-append on success path
else:
    errors.append(f"Failed to create PR in {repo}: {stderr[:500]}")  # <-- stale stderr from push
  1. On the success path (url_match was found), stdout/exit_code still hold values from the previous loop iteration's git push at line 64. If that happens to be exit_code == 0 and non-empty stdout, you append a bogus URL; otherwise you log "Failed to create PR" with stderr from a git push, even though the PR actually succeeded.
  2. On the fallback path (gh pr list), you also re-append stdout a second time via the trailing block — double URLs in the result.

Fix: delete lines 106–109 entirely. The if url_match / else block above already handles both branches.

continue
test_cmd = TEST_COMMANDS.get(repo, "npm test")
print(f" Running tests in {repo} (may take a few minutes)...", flush=True)
_, stderr, exit_code = client.run_command(session_id, f'cd {repo} && {test_cmd} 2>&1 | grep -E "(FAIL|PASS|Tests:|Test Suites:)" | tail -20')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typecheck/test exit_code is from tail, not from the command — verification will almost always falsely pass

Both the typecheck and test commands pipe through tail (and in the test case, also grep):

_, stderr, exit_code = client.run_command(session_id, f"cd {repo} && npm run typecheck 2>&1 | tail -5")
...
_, stderr, exit_code = client.run_command(session_id, f'cd {repo} && {test_cmd} 2>&1 | grep -E "(FAIL|PASS|...)" | tail -20')

In a POSIX shell without set -o pipefail, the pipeline's exit status is the exit status of the last command (tail), which is essentially always 0. That means typecheck_passes and tests_pass will be set to True even when npm run typecheck / npm run test:unit fail — defeating the whole point of Phase 2.5.

Also note stderr here is always empty because of 2>&1, so the error message in errors.append(f"...: {stderr[:500]}") is useless even when the check does catch a failure.

Fix options:

  1. Prefix each command with set -o pipefail; (bash only — not guaranteed in plain sh).
  2. Run the command twice: once to capture real exit code (npm run typecheck > /tmp/tc.log 2>&1), then tail -5 /tmp/tc.log to get the tail for display.
  3. Write to a file and check $? explicitly: npm run typecheck > /tmp/tc.log 2>&1; rc=$?; tail -5 /tmp/tc.log; exit $rc.

Option 3 is the most shell-portable.

if "agentcore-l3-cdk" in plan.lower() or "cdk" in plan.lower():
affected_repos.append("agentcore-l3-cdk-constructs")
if not affected_repos:
affected_repos = ["agentcore-cli"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

affected_repos detection is effectively always both repos

if "agentcore-cli" in plan.lower() or "cli" in plan.lower():
    affected_repos.append("agentcore-cli")
if "agentcore-l3-cdk" in plan.lower() or "cdk" in plan.lower():
    affected_repos.append("agentcore-l3-cdk-constructs")

The substring checks "cli" in plan.lower() and "cdk" in plan.lower() will match virtually any plan the LLM produces — plans routinely mention "CLI", "CDK", "agentcore-cli", etc. even when only one repo is actually touched. So affected_repos is almost always ["agentcore-cli", "agentcore-l3-cdk-constructs"], which then causes run_verify / run_complete / run_extract to cd into both repos, run tests in both, try to push branches in both, etc.

Downstream phases partially compensate by checking git diff main --stat before running tests/push (in verify.py), but run_extract does not — it runs git diff main in whatever the current directory is and will miss changes in the other repo.

Fix options:

  1. Stop guessing from the plan text. Instead, detect affected repos directly by asking the harness cd <repo> && git log main..HEAD --oneline in each known repo and keeping those with commits.
  2. Ask the planner to emit a structured affected_repos: [cli, cdk] list (e.g. a fenced JSON block) and parse it, rather than substring-matching prose.
  3. Keep a fixed list of both repos and let the per-repo git diff guards in verify.py filter — but then also fix run_extract to aggregate diffs from both repos.

Option 1 is the most robust since it measures ground truth.

@github-actions github-actions Bot removed the agentcore-harness-reviewing AgentCore Harness review in progress label May 5, 2026
Critical:
- Remove stale variables in complete.py causing duplicate PR URLs

High:
- Add input validation in feature-builder.yml (path traversal, command injection)
- Resolve AWS credentials per-request instead of freezing at construction
- Use format_map with defaults to prevent KeyError on missing template vars
- Capture test exit code separately from grep display in verify.py
- Make JSON brace-depth counter string-aware in parsing.py
- Gitignore config.yaml (contains account-specific ARN), add config.yaml.example
- Guard against empty changed_files in partition_round1_by_directory

Medium:
- Add type coercion for numeric overrides in orchestrator
- Only push after all local checks pass in verify.py
- Skip push when rebase fails in complete.py
- Lower Python requirement to >=3.12
- Widen boto3/botocore version constraints
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 5, 2026
Comment thread agents/orchestrations/fix_and_review/orchestrator.py Fixed
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 6, 2026
test_files.append(changed)
else:
# Look for adjacent test file
test_candidate = changed.replace("/src/", "/src/").replace(".ts", ".test.ts")
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 6, 2026
@github-actions github-actions Bot removed the size/xl PR size: XL label May 6, 2026
@github-actions github-actions Bot added the size/xl PR size: XL label May 6, 2026
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 6, 2026
)

# Phase 8: Complete
t0 = time.time()
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 6, 2026
Comment on lines +30 to +38
def invoke(
self,
session_id: str,
message: str,
system_prompt: str | None = None,
max_iterations: int | None = None,
verbose: bool = True,
retries: int = 2,
) -> str:
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 6, 2026
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 6, 2026
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 6, 2026
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/xl PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants