[examples] feat: blackbox mini-swe-agent training recipe by zhaizhiqiangA · Pull Request #73 · verl-project/uni-agent

zhaizhiqiangA · 2026-06-29T07:24:17Z

What does this PR do?

This PR adds a blackbox RL training recipe for mini-swe-agent under examples/blackbox_recipes/. The agent runs entirely inside an remote sandbox via a sidecar tool-image mount: the host-side runner creates the sandbox, pipes the task config to the in-sandbox agent over stdin, parses the result from stdout, and evaluates the reward in the same sandbox. The agent reaches the LLM through the gateway via an upstream tunnel, so training is fully "blackbox" — the trainer only sees prompts/responses through the gateway. Training uses the V1 unified trainer (Megatron backend, GRPO, separate_async).
Related work:

[train] feat: add blackbox agent gateway (#25) — the gateway this recipe runs against

Checklist Before Starting

Search for similar PRs or issues and paste at least one relevant link here:
gh pr list --repo verl-project/uni-agent --state open --search "mini-swe-agent"
No pull requests match your search in verl-project/uni-agent
Format the PR title as [examples] feat: blackbox mini-swe-agent training recipe

Test

A full RL training recipe is not practical to cover in CI, so validation was manual:

Inference smoke test — ran a single sample end-to-end against an remote sandbox.
Short training run — bash examples/blackbox_recipes/scripts/run_train.sh with the V1 separate_async trainer on a single 8-GPU node (4 trainer + 4 rollout).

API and Usage Example

This PR only adds files under examples/ plus minor internal import-path updates; there are no public API changes.

Design & Code Changes

New recipe — examples/blackbox_recipes/mini_swe_agent/

mini_swe_agent_runner.py — host-side runner. Creates a YRSandbox with the sidecar mounted at /opt/mini-swe-agent, base64-encodes the task config (task text + tunnel-rewritten gateway URL + step limit) and pipes it to run_agent.py via stdin, parses the JSON result from stdout (robust to litellm noise), then evaluates the reward in the same sandbox via SandboxEnvForReward and POSTs reward_info. Sandbox is always cleaned up in finally.
run_agent.py — in-sandbox entrypoint. Builds a LocalEnvironment + LitellmModel (pointed at the gateway tunnel) + DefaultAgent from mini-swe-agent's SWE-bench defaults, runs the task, emits a result JSON.
Dockerfile.mini-swe-agent-tool — self-contained, glibc-portable sidecar image (FROM scratch) so the sandbox base image needs no Python/Node.
dataset.py (SWEBenchDataset) injects verl-standard reward fields; reward.py reuses the uni_agent reward-spec registry to score resolved/unresolved in-env.
config/swe_agent_blackbox_megatron_v1.yaml + scripts/run_train.sh — V1 unified trainer, separate_async by default (4 GPU trainer + 4 GPU rollout on one node), vLLM async rollout, GRPO, Megatron offload.

Checklist Before Submitting

Read the Contribute Guide
Run pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add or update docs/examples for user-facing changes
Add tests or explain why tests are not practical
Confirm the PR title matches the required format
Confirm the placeholder text in this template has been replaced with real content

gemini-code-assist

Code Review

This pull request introduces a blackbox recipe for running mini_swe_agent inside an OpenYuanRong remote sandbox, including Dockerfiles, configurations, dataset and reward utilities, and runner scripts. Feedback focuses on improving robustness and preventing resource leaks, such as reordering task configuration before sandbox creation to avoid leaking sandboxes on error, shell-quoting file paths, wrapping blocking network calls in asyncio.to_thread, handling missing ports and preserving query parameters in gateway URLs, adding defensive type checks to prevent AttributeError or TypeError on missing/invalid dictionaries, and using ${BASH_SOURCE[0]} for robust script directory resolution.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T07:26:48Z

+    sandbox = await YRSandbox.create(
+        image=image, sidecar_image=tool_image, upstream=upstream, max_retries=int(sandbox_max_retries),
+    )
+    sandbox_id = sandbox.sandbox_id
+    logger.info("Sandbox created (image=%s, sandbox_id=%s)", image, sandbox_id)
+
+    # Build task config (gateway URL rewritten to sandbox-internal tunnel)
+    task_config = _build_task_config(
+        task=task,
+        gateway_url=gateway_url,
+    )
+
+    try:


If _build_task_config raises an exception (e.g., due to invalid environment variables or URL parsing issues), the remote sandbox created by YRSandbox.create will be leaked because the exception is raised before entering the try...finally block. To prevent resource leaks, execute _build_task_config before creating the sandbox.

Suggested change

sandbox = await YRSandbox.create(

image=image, sidecar_image=tool_image, upstream=upstream, max_retries=int(sandbox_max_retries),

)

sandbox_id = sandbox.sandbox_id

logger.info("Sandbox created (image=%s, sandbox_id=%s)", image, sandbox_id)

# Build task config (gateway URL rewritten to sandbox-internal tunnel)

task_config = _build_task_config(

task=task,

gateway_url=gateway_url,

)

try:

# Build task config (gateway URL rewritten to sandbox-internal tunnel)

task_config = _build_task_config(

task=task,

gateway_url=gateway_url,

)

sandbox = await YRSandbox.create(

image=image, sidecar_image=tool_image, upstream=upstream, max_retries=int(sandbox_max_retries),

)

sandbox_id = sandbox.sandbox_id

logger.info("Sandbox created (image=%s, sandbox_id=%s)", image, sandbox_id)

try:

gemini-code-assist · 2026-06-29T07:26:48Z

+    async def write_file(self, path: str | Path, content: str) -> None:
+        encoded = base64.b64encode(content.encode()).decode()
+        await self.communicate(f"echo {encoded} | base64 -d > {path}", check="raise", error_msg=f"write {path}")
+
+    async def read_file(self, path: str | Path, **_) -> str:
+        return await self.communicate(f"cat {path}")


If the file path contains spaces or special shell characters, the commands executed via communicate will fail or behave unexpectedly because the path is not shell-quoted. Use shlex.quote to safely escape the path.

Suggested change

async def write_file(self, path: str | Path, content: str) -> None:

encoded = base64.b64encode(content.encode()).decode()

await self.communicate(f"echo {encoded} | base64 -d > {path}", check="raise", error_msg=f"write {path}")

async def read_file(self, path: str | Path, **_) -> str:

return await self.communicate(f"cat {path}")

async def write_file(self, path: str | Path, content: str) -> None:

encoded = base64.b64encode(content.encode()).decode()

await self.communicate(f"echo {encoded} | base64 -d > {shlex.quote(str(path))}", check="raise", error_msg=f"write {path}")

async def read_file(self, path: str | Path, **_) -> str:

return await self.communicate(f"cat {shlex.quote(str(path))}")

gemini-code-assist · 2026-06-29T07:26:48Z

+                if self._sandbox.is_running():
+                    await asyncio.to_thread(self._sandbox.kill)
+                    logger.info("YR sandbox %s killed", sandbox_id)
+                else:
+                    logger.info("YR sandbox %s already stopped", sandbox_id)


self._sandbox.is_running() is a synchronous blocking network call to the remote sandbox SDK. Calling it directly in an async def function blocks the event loop. Wrap it in asyncio.to_thread to prevent blocking the main thread.

Suggested change

if self._sandbox.is_running():

await asyncio.to_thread(self._sandbox.kill)

logger.info("YR sandbox %s killed", sandbox_id)

else:

logger.info("YR sandbox %s already stopped", sandbox_id)

is_running = await asyncio.to_thread(self._sandbox.is_running)

if is_running:

await asyncio.to_thread(self._sandbox.kill)

logger.info("YR sandbox %s killed", sandbox_id)

else:

logger.info("YR sandbox %s already stopped", sandbox_id)

gemini-code-assist · 2026-06-29T07:26:49Z

+def extract_upstream(gateway_url: str) -> str:
+    """Extract host:port from a gateway URL for upstream tunnel config.
+
+    Example: "http://8.92.9.155:40169/sessions/abc/v1" -> "8.92.9.155:40169"
+    """
+    parsed = urlparse(gateway_url)
+    return f"{parsed.hostname}:{parsed.port}"


If the gateway_url does not specify an explicit port (e.g., standard http or https URLs), parsed.port will be None, resulting in an invalid upstream string like host:None. Handle missing ports by defaulting to 80 for http and 443 for https.

def extract_upstream(gateway_url: str) -> str: """Extract host:port from a gateway URL for upstream tunnel config. Example: "http://8.92.9.155:40169/sessions/abc/v1" -> "8.92.9.155:40169" """ parsed = urlparse(gateway_url) hostname = parsed.hostname or "" port = parsed.port if port is None: port = 443 if parsed.scheme == "https" else 80 return f"{hostname}:{port}"

gemini-code-assist · 2026-06-29T07:26:49Z

+    parsed = urlparse(gateway_url)
+    path = parsed.path.removesuffix("/v1") if strip_v1 else parsed.path
+    return f"http://127.0.0.1:{proxy_port}{path}"


Rebuilding the gateway URL using only parsed.path discards any query parameters or fragments present in the original URL. Preserve them by appending parsed.query and parsed.fragment to the rewritten URL.

Suggested change

parsed = urlparse(gateway_url)

path = parsed.path.removesuffix("/v1") if strip_v1 else parsed.path

return f"http://127.0.0.1:{proxy_port}{path}"

parsed = urlparse(gateway_url)

path = parsed.path.removesuffix("/v1") if strip_v1 else parsed.path

query = f"?{parsed.query}" if parsed.query else ""

fragment = f"#{parsed.fragment}" if parsed.fragment else ""

return f"http://127.0.0.1:{proxy_port}{path}{query}{fragment}"

gemini-code-assist · 2026-06-29T07:26:49Z

+    image = env_config.get("image")
+    if image:
+        return image
+    deployment = env_config.get("deployment")


If env_config is None or not a dictionary, calling env_config.get will raise an AttributeError. Add a type check to handle non-dictionary inputs gracefully.

if not isinstance(env_config, dict): return "" image = env_config.get("image") if image: return image deployment = env_config.get("deployment")

gemini-code-assist · 2026-06-29T07:26:49Z

+        extra_info = row_dict.get("extra_info", {})
+        tools_kwargs = extra_info.get("tools_kwargs", {})
+        reward_config = tools_kwargs.get("reward", {})
+
+        row_dict.setdefault("data_source", reward_config.get("name", "unknown"))
+        row_dict.setdefault("reward_model", {"ground_truth": {}})


If extra_info or tools_kwargs is missing or not a dictionary, calling .get on them will raise an AttributeError. Use defensive type checks to ensure robustness.

Suggested change

extra_info = row_dict.get("extra_info", {})

tools_kwargs = extra_info.get("tools_kwargs", {})

reward_config = tools_kwargs.get("reward", {})

row_dict.setdefault("data_source", reward_config.get("name", "unknown"))

row_dict.setdefault("reward_model", {"ground_truth": {}})

extra_info = row_dict.get("extra_info") or {}

tools_kwargs = extra_info.get("tools_kwargs") or {} if isinstance(extra_info, dict) else {}

reward_config = tools_kwargs.get("reward") or {} if isinstance(tools_kwargs, dict) else {}

row_dict.setdefault("data_source", reward_config.get("name", "unknown") if isinstance(reward_config, dict) else "unknown")

row_dict.setdefault("reward_model", {"ground_truth": {}})

gemini-code-assist · 2026-06-29T07:26:49Z

+    if extra_info and "reward_score" in extra_info:
+        score = float(extra_info["reward_score"])


If extra_info is not a dictionary (e.g., if it is None or another type), checking "reward_score" in extra_info can raise a TypeError. Ensure extra_info is a dictionary before performing the membership check.

Suggested change

if extra_info and "reward_score" in extra_info:

score = float(extra_info["reward_score"])

if isinstance(extra_info, dict) and "reward_score" in extra_info:

score = float(extra_info["reward_score"])

gemini-code-assist · 2026-06-29T07:26:49Z

+#
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"


Using dirname "$0" can be unreliable if the script is sourced or executed via certain shell interpreters. Using dirname "${BASH_SOURCE[0]}" is more robust and consistent with the other scripts in this repository (e.g., run_train.sh).

Suggested change

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

zhaizhiqiangA and others added 2 commits June 26, 2026 03:32

support mini-swe-agent and claud-code blockbox agent training recipes

00dd6e5

feat: add openyuanrong sandbox

a28540a

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

zhaizhiqiangA force-pushed the blackbox-recipe branch 2 times, most recently from ded27bf to 042dc66 Compare June 29, 2026 12:44

wuxibin89 mentioned this pull request Jun 30, 2026

[rollout] feat: add optional per-wave rollout resource warmup/cleanup hooks verl-project/verl#6895

Closed

zhaizhiqiangA force-pushed the blackbox-recipe branch from 042dc66 to 9235bf6 Compare June 30, 2026 08:54

support mini-swe-agent blockbox agent training recipes

9235bf6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[examples] feat: blackbox mini-swe-agent training recipe#73

[examples] feat: blackbox mini-swe-agent training recipe#73
zhaizhiqiangA wants to merge 3 commits into
verl-project:mainfrom
zhaizhiqiangA:blackbox-recipe

zhaizhiqiangA commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if extra_info and "reward_score" in extra_info:
		score = float(extra_info["reward_score"])

	SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
	SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

Uh oh!

Conversation

zhaizhiqiangA commented Jun 29, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants