Trampoline-AI · magix022 · Jun 4, 2026
diff --git a/.agents/skills/predict-rlm-contributor/SKILL.md b/.agents/skills/predict-rlm-contributor/SKILL.md
@@ -0,0 +1,48 @@
+---
+name: predict-rlm-contributor
+description: >
+  Contribute to the predict-rlm repository itself: modify core PredictRLM runtime
+  code, RLM-GEPA internals, built-in skills, examples, docs, tests, packaging, or
+  repo-scoped agent skill guidance. Use when the user asks to change this repo or
+  investigate a bug in predict-rlm/RLM-GEPA. Do not use for building a new
+  downstream RLM package; use rlm for that, or rlm-gepa for downstream
+  optimization wiring.
+---
+
+# Contribute To predict-rlm
+
+Use this skill for repository work. Do not run the new-RLM scoping interview
+unless the user is explicitly asking to build a downstream RLM package.
+
+## Reference Map
+
+Read only what the task needs:
+
+- `references/repo-map.md`: major modules, examples, and verification commands.
+- `references/contributor-rules.md`: repo-specific coding, docs, and PR rules.
+- `references/gepa-internals.md`: RLM-GEPA contribution boundaries and proposer
+  behavior rules.
+
+## Workflow
+
+1. Inspect the requested change and relevant repo paths before editing.
+2. Preserve the distinction between downstream usage and repo contribution.
+3. Keep changes scoped to the module, docs, examples, or skill guidance in the
+   request.
+4. Validate at system boundaries. Prefer host-side tools for native libraries,
+   auth, network APIs, filesystem-heavy work, and anything that cannot run
+   cleanly in Pyodide.
+5. Run targeted tests or checks. Docs-only and skill-only changes need markdown
+   sanity plus `git diff --check`; code changes need focused tests, with broader
+   tests when touching shared runtime, sandbox execution, optimizer behavior, or
+   examples.
+
+## Issue And PR Rules
+
+Creating GitHub PRs/issues or pushing public branches is external publishing.
+Do it only when explicitly requested.
+
+When an investigation identifies a bug likely attributable to the
+`predict-rlm` package, ask whether the user wants it reported as a GitHub issue
+as soon as attribution is clear. Do not open the issue without explicit
+approval.
diff --git a/.agents/skills/predict-rlm-contributor/references/contributor-rules.md b/.agents/skills/predict-rlm-contributor/references/contributor-rules.md
@@ -0,0 +1,26 @@
+# Contributor Rules
+
+- PredictRLM is for callable, repeatable, deep-context workflows, not open-ended
+  interactive chat flows.
+- Keep large inputs as `File` references or metadata. Use focused `predict()`
+  calls and keep LLM-facing Pydantic schemas lean with `Field(description=...)`.
+- Validate at system boundaries. Let library validation raise when schema fields
+  are required; do not add silent fallbacks.
+- Keep generic runtime behavior domain-neutral. Domain or benchmark specifics
+  belong in examples, `AgentSpec`, seed/domain skills, runtime-grounding
+  examples, or evaluator feedback.
+- Persist experimental behavior in config, CLI options, or artifacts rather than
+  hidden env-only switches.
+- Use Conventional Commits. The allowed scopes are `rlm-gepa`, `predict-rlm`,
+  and `examples/[example-name]`.
+- PR descriptions must start with **Rationale**, followed by Summary and Test
+  Plan.
+
+## Skill Guidance Changes
+
+Keep each repo skill focused on one job. Use short trigger descriptions with
+clear boundaries. Put detailed API and workflow material in one-level
+`references/` files linked from `SKILL.md`.
+
+Do not put downstream RLM-building guidance and repository-contributor guidance
+in the same `SKILL.md`.
diff --git a/.agents/skills/predict-rlm-contributor/references/gepa-internals.md b/.agents/skills/predict-rlm-contributor/references/gepa-internals.md
@@ -0,0 +1,20 @@
+# RLM-GEPA Internals
+
+Use these rules when changing `src/rlm_gepa/`, tests, examples, or docs.
+
+- Treat `AgentSpec`, evaluator feedback, and seed instructions as the
+  optimization direction. Keep runtime and budget knobs separate.
+- Derive signature and tool context from the constructed RLM with
+  `agent_spec_from_rlm(...)` where possible.
+- Avoid duplicating broad prose or exposing internal IDs unnecessarily.
+- Keep generic proposer behavior domain-neutral. Domain or benchmark specifics
+  belong in `AgentSpec`, seed/domain skills, runtime-grounding examples, or
+  evaluator feedback.
+- Patch-merge/crossover should be evidence-backed behavioral grafting from train
+  disagreement traces, not broad synthesis, prompt concatenation, or source text
+  import.
+- GEPA project wiring should live in downstream `gepa/` packages. Generic
+  optimizer orchestration belongs in `src/rlm_gepa/`.
+
+For verification, run targeted RLM-GEPA tests when touching optimizer schemas,
+runtime adapters, proposer behavior, reporting, or SpreadBench GEPA wiring.
diff --git a/.agents/skills/predict-rlm-contributor/references/repo-map.md b/.agents/skills/predict-rlm-contributor/references/repo-map.md
@@ -0,0 +1,49 @@
+# Repo Map
+
+`predict-rlm` extends DSPy's RLM with a built-in `predict()` tool. It has a
+two-level execution model:
+
+1. The outer LLM writes and executes Python in a sandbox.
+2. The sub-LM handles perception and extraction through `predict()` calls.
+
+## Key Modules
+
+- `src/predict_rlm/predict_rlm.py`: `PredictRLM`, `predict()` tool creation,
+  action/extract signatures, LM contexts, and file I/O orchestration.
+- `src/predict_rlm/backends/jspi/backend.py`: default Deno/Pyodide backend.
+- `src/predict_rlm/backends/sbx/backend.py`: Docker Sandboxes backend.
+- `src/predict_rlm/backends/supervisor/`: shared sandbox runner process
+  supervision.
+- `src/predict_rlm/rlm_skills.py`: `Skill` dataclass and `merge_skills()`.
+- `src/predict_rlm/_shared.py`: action/extract signature construction and tool
+  doc formatting.
+- `src/predict_rlm/skills/`: built-in `pdf`, `spreadsheet`, and `docx` skills.
+- `src/rlm_gepa/`: RLM-GEPA optimizer integration.
+- `.agents/skills/`: repo-scoped agent skills for downstream users and
+  contributors.
+
+## Example Structure
+
+Examples generally follow:
+
+```text
+schema.py -> signature.py -> tools.py -> skills.py -> service.py -> run.py
+```
+
+Keep generated or example RLM packages grouped under `agent/`, with optional
+`tools/`, `bench/`, and `gepa/` packages only when needed.
+
+## Common Commands
+
+```bash
+uv sync
+uv sync --extra examples
+make test-unit
+make test-integration
+uv run pytest tests/test_predict_rlm.py::TestPredictTool::test_name -v
+uv run ruff check src/ tests/
+git diff --check
+```
+
+Use targeted checks for narrow changes. Run broader suites when touching shared
+interfaces, sandbox execution, optimizer behavior, or examples.
diff --git a/.agents/skills/rlm-gepa/SKILL.md b/.agents/skills/rlm-gepa/SKILL.md
@@ -0,0 +1,97 @@
+---
+name: rlm-gepa
+description: >
+  Design, scaffold, and use RLM-GEPA optimization wiring for PredictRLM projects,
+  including AgentSpec scoping, train/validation data, scoring feedback, seed
+  candidates, GEPA project files, and optimize/eval CLI setup. Use when the user
+  asks for GEPA, prompt or skill optimization, candidate selection from RLM
+  traces, AgentSpec, RLMGepaProject, optimization metrics, or train/validation
+  split design. Do not use for modifying the predict-rlm repository internals;
+  use predict-rlm-contributor for that.
+---
+
+# RLM-GEPA Optimization
+
+RLM-GEPA optimizes reusable PredictRLM text components, usually skill
+instructions, from execution traces. A project defines the agent to run, the
+train/validation examples to evaluate, the scoring feedback, and an `AgentSpec`
+that tells the proposer what reusable behavior is in scope.
+
+Use this skill when optimization is in scope. If the user only wants a callable
+RLM with no GEPA wiring, use `rlm`. If the user is changing the `predict-rlm`
+repo implementation, use `predict-rlm-contributor`.
+
+## Reference Map
+
+Read only what the task needs:
+
+- `references/agent-spec.md`: `AgentSpec` scoping, `agent_spec_from_rlm(...)`,
+  component focus, and anti-duplication rules.
+- `references/data-and-scoring.md`: dataset audit, split hygiene, scoring
+  feedback, and overfitting boundaries.
+- `references/project-layout.md`: generated `gepa/` package shape, CLI wiring,
+  and verification commands.
+
+## Workflow
+
+### 1. Confirm The Optimization Target
+
+Identify the PredictRLM workflow that GEPA should improve. If the RLM does not
+exist yet, first scope the RLM enough to define its real DSPy signature, skills,
+tools, inputs, and outputs. Do not ask the user to hand-write
+`target_signature` or `tool_signatures`; derive them from the constructed RLM.
+
+### 2. Scope The GEPA Brief
+
+Interview only for context GEPA cannot infer:
+
+- product or optimization goal;
+- input distribution, scale, and representative examples;
+- output schema and important failure modes;
+- train/validation data source;
+- labels, references, or scoring rule;
+- partial-credit feedback and anti-overfitting boundary;
+- tools, sandbox facts, file conventions, and runtime constraints.
+
+If the user cannot answer everything, proceed with explicit assumptions and mark
+fields that must be revisited before spending model calls.
+
+### 3. Audit Data And Scoring
+
+Read `references/data-and-scoring.md` before writing split or scoring code.
+Inspect examples enough to identify task types, input sizes, labels/reference
+shape, duplicates, leakage risks, missing labels, and failure buckets.
+
+Use train examples to propose and gate edits. Use validation examples for
+candidate selection and regression checks. Create a held-out test set only when
+the user asks for a benchmark/eval harness and the dataset size supports it.
+
+### 4. Design Components
+
+The most common component is `skill_instructions`, but multi-component projects
+can optimize several text blocks. `seed_candidate()` must return exactly the
+keys listed in `components`.
+
+Keep runtime and budget knobs out of the `AgentSpec`. Use `AgentSpec`, evaluator
+feedback, and seed instructions to steer optimization direction. Use CLI/config
+for `max_metric_calls`, minibatch size, concurrency, model choices, and runtime
+limits.
+
+### 5. Scaffold Project Wiring
+
+Create project-local `gepa/` files only when the user asks for optimization.
+The generated package owns task loading, metrics, seed candidate text, defaults,
+and CLI glue. The shared `rlm_gepa` package owns generic orchestration.
+
+Use `references/project-layout.md` for files and imports. Add the GEPA package
+extra and `rlm-gepa` console script in `pyproject.toml` when scaffolding a full
+project.
+
+### 6. Verify Before Running Optimization
+
+Add fast checks that load train/validation data, construct the project, verify
+the seed candidate keys, and build the target RLM without running a costly
+optimization.
+
+Run `uv run rlm-gepa optimize --check` when the project CLI exists. For docs-only
+or scaffolding changes, also run markdown sanity checks and `git diff --check`.
diff --git a/.agents/skills/rlm-gepa/references/agent-spec.md b/.agents/skills/rlm-gepa/references/agent-spec.md
@@ -0,0 +1,61 @@
+# AgentSpec
+
+Prefer `agent_spec_from_rlm(...)` for new projects. The RLM stays the source of
+truth for the DSPy signature, output schema, skills, and tools.
+
+```python
+from rlm_gepa import agent_spec_from_rlm
+
+agent_spec = agent_spec_from_rlm(
+    build_rlm(SEED_SKILL_INSTRUCTIONS),
+    use_cases=[
+        "contract review with clause-level citations",
+        "invoice analysis with total reconciliation",
+    ],
+    runtime_grounding_examples={
+        "skills": ["document-analysis skill instructions are optimized"],
+        "sandbox facts": ["Pyodide filesystem paths and package limits"],
+    },
+    scoring_description=(
+        "Score combines answer correctness and citation support. Feedback names "
+        "missing findings, unsupported citations, and extraction errors."
+    ),
+)
+```
+
+Do not duplicate facts `agent_spec_from_rlm(...)` can derive. Add only context
+GEPA cannot infer:
+
+- transfer use cases beyond the benchmark;
+- runtime-grounding examples the proposer must preserve;
+- scoring signal and evaluator feedback shape;
+- anti-overfitting boundaries;
+- short product or optimization framing, only when it adds useful context.
+
+Omit `agent_type` by default. Set it only when a concise product or optimization
+anchor adds information not already present in the signature, tools, or output
+schema.
+
+## Components
+
+`components` names mutable text fields. `seed_candidate()` must return exactly
+those keys.
+
+```python
+class MyProject(RLMGepaProject):
+    components = ("skill_instructions",)
+
+    def seed_candidate(self) -> dict[str, str]:
+        return {"skill_instructions": SEED_SKILL_INSTRUCTIONS}
+```
+
+Override `component_focus(component_name)` when each component needs a different
+proposer brief. Keep component names stable so runs and candidate artifacts are
+comparable.
+
+## Proposer Boundaries
+
+Patch-merge/crossover should be evidence-backed behavioral grafting from train
+disagreement traces. Avoid broad synthesis, prompt concatenation, source text
+imports, or benchmark-specific hacks. Domain specifics belong in `AgentSpec`,
+seed/domain skills, runtime-grounding examples, or evaluator feedback.
diff --git a/.agents/skills/rlm-gepa/references/data-and-scoring.md b/.agents/skills/rlm-gepa/references/data-and-scoring.md
@@ -0,0 +1,56 @@
+# Data And Scoring
+
+Investigate the dataset before writing split or scoring code. Do not treat it
+as an opaque list of rows.
+
+Inspect enough examples to identify:
+
+- task types and input sizes;
+- label or reference-output shape;
+- duplicate or near-duplicate examples;
+- missing labels or ambiguous references;
+- source grouping keys such as document, user, customer, or task family;
+- failure buckets the scorer should expose.
+
+## Split Semantics
+
+Use split names consistently:
+
+- **Train**: examples the optimizer/proposer may use to generate and gate edits.
+- **Validation**: examples used for candidate selection and regression checks.
+- **Test / held-out eval**: optional final reporting set.
+
+Prefer deterministic splits. Put random seed, split ratio/counts, grouping key,
+and sampling limits in `bench/config.py` or `gepa/config.py`. Split by group when
+leakage is plausible. Never let near-identical examples from the same source
+land in both train and validation without calling it out.
+
+If the dataset is tiny, prefer explicit hand-authored train/validation files
+over random splitting.
+
+## Scoring Feedback
+
+Each `evaluate_example()` should return a scalar score plus feedback that helps
+the proposer make a targeted behavioral change.
+
+Good feedback names concrete misses:
+
+- missing fields;
+- unsupported citations;
+- extraction or parsing errors;
+- wrong calculations;
+- formatting or file-output failures;
+- tool-use mistakes visible in traces.
+
+Avoid feedback that only says "wrong" or restates the score. GEPA quality is
+bounded by the evidence the metric returns.
+
+## Overfitting Boundaries
+
+State what counts as a transferable improvement versus a benchmark-specific
+hack. Examples:
+
+- preserve citation grounding instead of memorizing answer strings;
+- improve table handling generally instead of keying on fixture names;
+- preserve sandbox path conventions and tool APIs;
+- prefer behavior that transfers across document lengths and layouts.