From 6b5f0568f7163861f92e1963ccfb5d34d204b4df Mon Sep 17 00:00:00 2001 From: Ammar Date: Thu, 12 Feb 2026 09:17:22 -0600 Subject: [PATCH] bench: add HF upload safety guidance to tbench skill - Always check for existing open PRs before creating new ones (uploads often timeout but succeed server-side) - Push corrections to existing PRs via revision param, never re-create - Close accidental duplicates immediately - Do not coalesce runs into one job folder (validator checks job_id) --- .mux/skills/tbench/SKILL.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/.mux/skills/tbench/SKILL.md b/.mux/skills/tbench/SKILL.md index 0edc915517..4280808652 100644 --- a/.mux/skills/tbench/SKILL.md +++ b/.mux/skills/tbench/SKILL.md @@ -225,12 +225,33 @@ api.upload_folder( The PR will be automatically validated by the leaderboard bot. Once merged, results appear on the leaderboard. +**⚠️ CRITICAL: Do not spam the maintainer with duplicate PRs.** + +Uploads often timeout even when they succeed server-side. **Before retrying +or creating a new PR**, always check for existing open PRs first: + +```python +import requests +resp = requests.get( + "https://huggingface.co/api/datasets/alexgshaw/terminal-bench-2-leaderboard/discussions", + params={"status": "open"}, +).json() +for d in resp.get("discussions", []): + print(f'PR #{d["num"]}: {d["title"]} — {d["status"]}') +``` + +If a timed-out upload already created a PR, push corrections to that PR using +`revision="refs/pr/"` — never call `create_pr=True` again for the same +submission. If duplicate PRs are discovered, **stop and ask the User** which +to keep/close before taking any action. + **Tips from past submissions:** - The prepare script already strips `*.log` files (they trigger HF LFS and cause timeouts) - `--artifacts-dir` accepts raw job folders directly (e.g., an extracted tarball root) - To update an existing PR, pass `revision="refs/pr/"` instead of `create_pr=True` - To remove stale files from a PR, use `api.delete_folder(..., revision="refs/pr/")` +- Do **not** coalesce multiple runs into a single job folder — the validator checks that each trial's `config.job_id` matches its parent job's `id`. Keep one job folder per run. ## Files