Skip to content

[examples] feat: blackbox mini-swe-agent training recipe#73

Open
zhaizhiqiangA wants to merge 3 commits into
verl-project:mainfrom
zhaizhiqiangA:blackbox-recipe
Open

[examples] feat: blackbox mini-swe-agent training recipe#73
zhaizhiqiangA wants to merge 3 commits into
verl-project:mainfrom
zhaizhiqiangA:blackbox-recipe

Conversation

@zhaizhiqiangA

Copy link
Copy Markdown
Collaborator

What does this PR do?

This PR adds a blackbox RL training recipe for mini-swe-agent under examples/blackbox_recipes/. The agent runs entirely inside an remote sandbox via a sidecar tool-image mount: the host-side runner creates the sandbox, pipes the task config to the in-sandbox agent over stdin, parses the result from stdout, and evaluates the reward in the same sandbox. The agent reaches the LLM through the gateway via an upstream tunnel, so training is fully "blackbox" — the trainer only sees prompts/responses through the gateway. Training uses the V1 unified trainer (Megatron backend, GRPO, separate_async).
Related work:

[train] feat: add blackbox agent gateway (#25) — the gateway this recipe runs against

Checklist Before Starting

  • Search for similar PRs or issues and paste at least one relevant link here:
    gh pr list --repo verl-project/uni-agent --state open --search "mini-swe-agent"
    No pull requests match your search in verl-project/uni-agent
  • Format the PR title as [examples] feat: blackbox mini-swe-agent training recipe

Test

A full RL training recipe is not practical to cover in CI, so validation was manual:

  • Inference smoke test — ran a single sample end-to-end against an remote sandbox.
  • Short training run — bash examples/blackbox_recipes/scripts/run_train.sh with the V1 separate_async trainer on a single 8-GPU node (4 trainer + 4 rollout).

API and Usage Example

This PR only adds files under examples/ plus minor internal import-path updates; there are no public API changes.

Design & Code Changes

New recipe — examples/blackbox_recipes/mini_swe_agent/

mini_swe_agent_runner.py — host-side runner. Creates a YRSandbox with the sidecar mounted at /opt/mini-swe-agent, base64-encodes the task config (task text + tunnel-rewritten gateway URL + step limit) and pipes it to run_agent.py via stdin, parses the JSON result from stdout (robust to litellm noise), then evaluates the reward in the same sandbox via SandboxEnvForReward and POSTs reward_info. Sandbox is always cleaned up in finally.
run_agent.py — in-sandbox entrypoint. Builds a LocalEnvironment + LitellmModel (pointed at the gateway tunnel) + DefaultAgent from mini-swe-agent's SWE-bench defaults, runs the task, emits a result JSON.
Dockerfile.mini-swe-agent-tool — self-contained, glibc-portable sidecar image (FROM scratch) so the sandbox base image needs no Python/Node.
dataset.py (SWEBenchDataset) injects verl-standard reward fields; reward.py reuses the uni_agent reward-spec registry to score resolved/unresolved in-env.
config/swe_agent_blackbox_megatron_v1.yaml + scripts/run_train.sh — V1 unified trainer, separate_async by default (4 GPU trainer + 4 GPU rollout on one node), vLLM async rollout, GRPO, Megatron offload.

Checklist Before Submitting

  • Read the Contribute Guide
  • Run pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  • Add or update docs/examples for user-facing changes
  • Add tests or explain why tests are not practical
  • Confirm the PR title matches the required format
  • Confirm the placeholder text in this template has been replaced with real content

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a blackbox recipe for running mini_swe_agent inside an OpenYuanRong remote sandbox, including Dockerfiles, configurations, dataset and reward utilities, and runner scripts. Feedback focuses on improving robustness and preventing resource leaks, such as reordering task configuration before sandbox creation to avoid leaking sandboxes on error, shell-quoting file paths, wrapping blocking network calls in asyncio.to_thread, handling missing ports and preserving query parameters in gateway URLs, adding defensive type checks to prevent AttributeError or TypeError on missing/invalid dictionaries, and using ${BASH_SOURCE[0]} for robust script directory resolution.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +144 to +156
sandbox = await YRSandbox.create(
image=image, sidecar_image=tool_image, upstream=upstream, max_retries=int(sandbox_max_retries),
)
sandbox_id = sandbox.sandbox_id
logger.info("Sandbox created (image=%s, sandbox_id=%s)", image, sandbox_id)

# Build task config (gateway URL rewritten to sandbox-internal tunnel)
task_config = _build_task_config(
task=task,
gateway_url=gateway_url,
)

try:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If _build_task_config raises an exception (e.g., due to invalid environment variables or URL parsing issues), the remote sandbox created by YRSandbox.create will be leaked because the exception is raised before entering the try...finally block. To prevent resource leaks, execute _build_task_config before creating the sandbox.

Suggested change
sandbox = await YRSandbox.create(
image=image, sidecar_image=tool_image, upstream=upstream, max_retries=int(sandbox_max_retries),
)
sandbox_id = sandbox.sandbox_id
logger.info("Sandbox created (image=%s, sandbox_id=%s)", image, sandbox_id)
# Build task config (gateway URL rewritten to sandbox-internal tunnel)
task_config = _build_task_config(
task=task,
gateway_url=gateway_url,
)
try:
# Build task config (gateway URL rewritten to sandbox-internal tunnel)
task_config = _build_task_config(
task=task,
gateway_url=gateway_url,
)
sandbox = await YRSandbox.create(
image=image, sidecar_image=tool_image, upstream=upstream, max_retries=int(sandbox_max_retries),
)
sandbox_id = sandbox.sandbox_id
logger.info("Sandbox created (image=%s, sandbox_id=%s)", image, sandbox_id)
try:

Comment on lines +45 to +50
async def write_file(self, path: str | Path, content: str) -> None:
encoded = base64.b64encode(content.encode()).decode()
await self.communicate(f"echo {encoded} | base64 -d > {path}", check="raise", error_msg=f"write {path}")

async def read_file(self, path: str | Path, **_) -> str:
return await self.communicate(f"cat {path}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the file path contains spaces or special shell characters, the commands executed via communicate will fail or behave unexpectedly because the path is not shell-quoted. Use shlex.quote to safely escape the path.

Suggested change
async def write_file(self, path: str | Path, content: str) -> None:
encoded = base64.b64encode(content.encode()).decode()
await self.communicate(f"echo {encoded} | base64 -d > {path}", check="raise", error_msg=f"write {path}")
async def read_file(self, path: str | Path, **_) -> str:
return await self.communicate(f"cat {path}")
async def write_file(self, path: str | Path, content: str) -> None:
encoded = base64.b64encode(content.encode()).decode()
await self.communicate(f"echo {encoded} | base64 -d > {shlex.quote(str(path))}", check="raise", error_msg=f"write {path}")
async def read_file(self, path: str | Path, **_) -> str:
return await self.communicate(f"cat {shlex.quote(str(path))}")

Comment on lines +193 to +197
if self._sandbox.is_running():
await asyncio.to_thread(self._sandbox.kill)
logger.info("YR sandbox %s killed", sandbox_id)
else:
logger.info("YR sandbox %s already stopped", sandbox_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

self._sandbox.is_running() is a synchronous blocking network call to the remote sandbox SDK. Calling it directly in an async def function blocks the event loop. Wrap it in asyncio.to_thread to prevent blocking the main thread.

Suggested change
if self._sandbox.is_running():
await asyncio.to_thread(self._sandbox.kill)
logger.info("YR sandbox %s killed", sandbox_id)
else:
logger.info("YR sandbox %s already stopped", sandbox_id)
is_running = await asyncio.to_thread(self._sandbox.is_running)
if is_running:
await asyncio.to_thread(self._sandbox.kill)
logger.info("YR sandbox %s killed", sandbox_id)
else:
logger.info("YR sandbox %s already stopped", sandbox_id)

Comment on lines +55 to +61
def extract_upstream(gateway_url: str) -> str:
"""Extract host:port from a gateway URL for upstream tunnel config.

Example: "http://8.92.9.155:40169/sessions/abc/v1" -> "8.92.9.155:40169"
"""
parsed = urlparse(gateway_url)
return f"{parsed.hostname}:{parsed.port}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the gateway_url does not specify an explicit port (e.g., standard http or https URLs), parsed.port will be None, resulting in an invalid upstream string like host:None. Handle missing ports by defaulting to 80 for http and 443 for https.

def extract_upstream(gateway_url: str) -> str:
    """Extract host:port from a gateway URL for upstream tunnel config.

    Example: "http://8.92.9.155:40169/sessions/abc/v1" -> "8.92.9.155:40169"
    """
    parsed = urlparse(gateway_url)
    hostname = parsed.hostname or ""
    port = parsed.port
    if port is None:
        port = 443 if parsed.scheme == "https" else 80
    return f"{hostname}:{port}"

Comment on lines +78 to +80
parsed = urlparse(gateway_url)
path = parsed.path.removesuffix("/v1") if strip_v1 else parsed.path
return f"http://127.0.0.1:{proxy_port}{path}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Rebuilding the gateway URL using only parsed.path discards any query parameters or fragments present in the original URL. Preserve them by appending parsed.query and parsed.fragment to the rewritten URL.

Suggested change
parsed = urlparse(gateway_url)
path = parsed.path.removesuffix("/v1") if strip_v1 else parsed.path
return f"http://127.0.0.1:{proxy_port}{path}"
parsed = urlparse(gateway_url)
path = parsed.path.removesuffix("/v1") if strip_v1 else parsed.path
query = f"?{parsed.query}" if parsed.query else ""
fragment = f"#{parsed.fragment}" if parsed.fragment else ""
return f"http://127.0.0.1:{proxy_port}{path}{query}{fragment}"

Comment on lines +12 to +15
image = env_config.get("image")
if image:
return image
deployment = env_config.get("deployment")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If env_config is None or not a dictionary, calling env_config.get will raise an AttributeError. Add a type check to handle non-dictionary inputs gracefully.

    if not isinstance(env_config, dict):
        return ""
    image = env_config.get("image")
    if image:
        return image
    deployment = env_config.get("deployment")

Comment on lines +27 to +32
extra_info = row_dict.get("extra_info", {})
tools_kwargs = extra_info.get("tools_kwargs", {})
reward_config = tools_kwargs.get("reward", {})

row_dict.setdefault("data_source", reward_config.get("name", "unknown"))
row_dict.setdefault("reward_model", {"ground_truth": {}})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If extra_info or tools_kwargs is missing or not a dictionary, calling .get on them will raise an AttributeError. Use defensive type checks to ensure robustness.

Suggested change
extra_info = row_dict.get("extra_info", {})
tools_kwargs = extra_info.get("tools_kwargs", {})
reward_config = tools_kwargs.get("reward", {})
row_dict.setdefault("data_source", reward_config.get("name", "unknown"))
row_dict.setdefault("reward_model", {"ground_truth": {}})
extra_info = row_dict.get("extra_info") or {}
tools_kwargs = extra_info.get("tools_kwargs") or {} if isinstance(extra_info, dict) else {}
reward_config = tools_kwargs.get("reward") or {} if isinstance(tools_kwargs, dict) else {}
row_dict.setdefault("data_source", reward_config.get("name", "unknown") if isinstance(reward_config, dict) else "unknown")
row_dict.setdefault("reward_model", {"ground_truth": {}})

Comment on lines +32 to +33
if extra_info and "reward_score" in extra_info:
score = float(extra_info["reward_score"])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If extra_info is not a dictionary (e.g., if it is None or another type), checking "reward_score" in extra_info can raise a TypeError. Ensure extra_info is a dictionary before performing the membership check.

Suggested change
if extra_info and "reward_score" in extra_info:
score = float(extra_info["reward_score"])
if isinstance(extra_info, dict) and "reward_score" in extra_info:
score = float(extra_info["reward_score"])

#
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using dirname "$0" can be unreliable if the script is sourced or executed via certain shell interpreters. Using dirname "${BASH_SOURCE[0]}" is more robust and consistent with the other scripts in this repository (e.g., run_train.sh).

Suggested change
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants