Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .agents/skills/predict-rlm-contributor/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
name: predict-rlm-contributor
description: >
Contribute to the predict-rlm repository itself: modify core PredictRLM runtime
code, RLM-GEPA internals, built-in skills, examples, docs, tests, packaging, or
repo-scoped agent skill guidance. Use when the user asks to change this repo or
investigate a bug in predict-rlm/RLM-GEPA. Do not use for building a new
downstream RLM package; use rlm for that, or rlm-gepa for downstream
optimization wiring.
---

# Contribute To predict-rlm

Use this skill for repository work. Do not run the new-RLM scoping interview
unless the user is explicitly asking to build a downstream RLM package.

## Reference Map

Read only what the task needs:

- `references/repo-map.md`: major modules, examples, and verification commands.
- `references/contributor-rules.md`: repo-specific coding, docs, and PR rules.
- `references/gepa-internals.md`: RLM-GEPA contribution boundaries and proposer
behavior rules.

## Workflow

1. Inspect the requested change and relevant repo paths before editing.
2. Preserve the distinction between downstream usage and repo contribution.
3. Keep changes scoped to the module, docs, examples, or skill guidance in the
request.
4. Validate at system boundaries. Prefer host-side tools for native libraries,
auth, network APIs, filesystem-heavy work, and anything that cannot run
cleanly in Pyodide.
5. Run targeted tests or checks. Docs-only and skill-only changes need markdown
sanity plus `git diff --check`; code changes need focused tests, with broader
tests when touching shared runtime, sandbox execution, optimizer behavior, or
examples.

## Issue And PR Rules

Creating GitHub PRs/issues or pushing public branches is external publishing.
Do it only when explicitly requested.

When an investigation identifies a bug likely attributable to the
`predict-rlm` package, ask whether the user wants it reported as a GitHub issue
as soon as attribution is clear. Do not open the issue without explicit
approval.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Contributor Rules

- PredictRLM is for callable, repeatable, deep-context workflows, not open-ended
interactive chat flows.
- Keep large inputs as `File` references or metadata. Use focused `predict()`
calls and keep LLM-facing Pydantic schemas lean with `Field(description=...)`.
- Validate at system boundaries. Let library validation raise when schema fields
are required; do not add silent fallbacks.
- Keep generic runtime behavior domain-neutral. Domain or benchmark specifics
belong in examples, `AgentSpec`, seed/domain skills, runtime-grounding
examples, or evaluator feedback.
- Persist experimental behavior in config, CLI options, or artifacts rather than
hidden env-only switches.
- Use Conventional Commits. The allowed scopes are `rlm-gepa`, `predict-rlm`,
and `examples/[example-name]`.
- PR descriptions must start with **Rationale**, followed by Summary and Test
Plan.

## Skill Guidance Changes

Keep each repo skill focused on one job. Use short trigger descriptions with
clear boundaries. Put detailed API and workflow material in one-level
`references/` files linked from `SKILL.md`.

Do not put downstream RLM-building guidance and repository-contributor guidance
in the same `SKILL.md`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# RLM-GEPA Internals

Use these rules when changing `src/rlm_gepa/`, tests, examples, or docs.

- Treat `AgentSpec`, evaluator feedback, and seed instructions as the
optimization direction. Keep runtime and budget knobs separate.
- Derive signature and tool context from the constructed RLM with
`agent_spec_from_rlm(...)` where possible.
- Avoid duplicating broad prose or exposing internal IDs unnecessarily.
- Keep generic proposer behavior domain-neutral. Domain or benchmark specifics
belong in `AgentSpec`, seed/domain skills, runtime-grounding examples, or
evaluator feedback.
- Patch-merge/crossover should be evidence-backed behavioral grafting from train
disagreement traces, not broad synthesis, prompt concatenation, or source text
import.
- GEPA project wiring should live in downstream `gepa/` packages. Generic
optimizer orchestration belongs in `src/rlm_gepa/`.

For verification, run targeted RLM-GEPA tests when touching optimizer schemas,
runtime adapters, proposer behavior, reporting, or SpreadBench GEPA wiring.
49 changes: 49 additions & 0 deletions .agents/skills/predict-rlm-contributor/references/repo-map.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Repo Map

`predict-rlm` extends DSPy's RLM with a built-in `predict()` tool. It has a
two-level execution model:

1. The outer LLM writes and executes Python in a sandbox.
2. The sub-LM handles perception and extraction through `predict()` calls.

## Key Modules

- `src/predict_rlm/predict_rlm.py`: `PredictRLM`, `predict()` tool creation,
action/extract signatures, LM contexts, and file I/O orchestration.
- `src/predict_rlm/backends/jspi/backend.py`: default Deno/Pyodide backend.
- `src/predict_rlm/backends/sbx/backend.py`: Docker Sandboxes backend.
- `src/predict_rlm/backends/supervisor/`: shared sandbox runner process
supervision.
- `src/predict_rlm/rlm_skills.py`: `Skill` dataclass and `merge_skills()`.
- `src/predict_rlm/_shared.py`: action/extract signature construction and tool
doc formatting.
- `src/predict_rlm/skills/`: built-in `pdf`, `spreadsheet`, and `docx` skills.
- `src/rlm_gepa/`: RLM-GEPA optimizer integration.
- `.agents/skills/`: repo-scoped agent skills for downstream users and
contributors.

## Example Structure

Examples generally follow:

```text
schema.py -> signature.py -> tools.py -> skills.py -> service.py -> run.py
```

Keep generated or example RLM packages grouped under `agent/`, with optional
`tools/`, `bench/`, and `gepa/` packages only when needed.

## Common Commands

```bash
uv sync
uv sync --extra examples
make test-unit
make test-integration
uv run pytest tests/test_predict_rlm.py::TestPredictTool::test_name -v
uv run ruff check src/ tests/
git diff --check
```

Use targeted checks for narrow changes. Run broader suites when touching shared
interfaces, sandbox execution, optimizer behavior, or examples.
97 changes: 97 additions & 0 deletions .agents/skills/rlm-gepa/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
name: rlm-gepa
description: >
Design, scaffold, and use RLM-GEPA optimization wiring for PredictRLM projects,
including AgentSpec scoping, train/validation data, scoring feedback, seed
candidates, GEPA project files, and optimize/eval CLI setup. Use when the user
asks for GEPA, prompt or skill optimization, candidate selection from RLM
traces, AgentSpec, RLMGepaProject, optimization metrics, or train/validation
split design. Do not use for modifying the predict-rlm repository internals;
use predict-rlm-contributor for that.
---

# RLM-GEPA Optimization

RLM-GEPA optimizes reusable PredictRLM text components, usually skill
instructions, from execution traces. A project defines the agent to run, the
train/validation examples to evaluate, the scoring feedback, and an `AgentSpec`
that tells the proposer what reusable behavior is in scope.

Use this skill when optimization is in scope. If the user only wants a callable
RLM with no GEPA wiring, use `rlm`. If the user is changing the `predict-rlm`
repo implementation, use `predict-rlm-contributor`.

## Reference Map

Read only what the task needs:

- `references/agent-spec.md`: `AgentSpec` scoping, `agent_spec_from_rlm(...)`,
component focus, and anti-duplication rules.
- `references/data-and-scoring.md`: dataset audit, split hygiene, scoring
feedback, and overfitting boundaries.
- `references/project-layout.md`: generated `gepa/` package shape, CLI wiring,
and verification commands.

## Workflow

### 1. Confirm The Optimization Target

Identify the PredictRLM workflow that GEPA should improve. If the RLM does not
exist yet, first scope the RLM enough to define its real DSPy signature, skills,
tools, inputs, and outputs. Do not ask the user to hand-write
`target_signature` or `tool_signatures`; derive them from the constructed RLM.

### 2. Scope The GEPA Brief

Interview only for context GEPA cannot infer:

- product or optimization goal;
- input distribution, scale, and representative examples;
- output schema and important failure modes;
- train/validation data source;
- labels, references, or scoring rule;
- partial-credit feedback and anti-overfitting boundary;
- tools, sandbox facts, file conventions, and runtime constraints.

If the user cannot answer everything, proceed with explicit assumptions and mark
fields that must be revisited before spending model calls.

### 3. Audit Data And Scoring

Read `references/data-and-scoring.md` before writing split or scoring code.
Inspect examples enough to identify task types, input sizes, labels/reference
shape, duplicates, leakage risks, missing labels, and failure buckets.

Use train examples to propose and gate edits. Use validation examples for
candidate selection and regression checks. Create a held-out test set only when
the user asks for a benchmark/eval harness and the dataset size supports it.

### 4. Design Components

The most common component is `skill_instructions`, but multi-component projects
can optimize several text blocks. `seed_candidate()` must return exactly the
keys listed in `components`.

Keep runtime and budget knobs out of the `AgentSpec`. Use `AgentSpec`, evaluator
feedback, and seed instructions to steer optimization direction. Use CLI/config
for `max_metric_calls`, minibatch size, concurrency, model choices, and runtime
limits.

### 5. Scaffold Project Wiring

Create project-local `gepa/` files only when the user asks for optimization.
The generated package owns task loading, metrics, seed candidate text, defaults,
and CLI glue. The shared `rlm_gepa` package owns generic orchestration.

Use `references/project-layout.md` for files and imports. Add the GEPA package
extra and `rlm-gepa` console script in `pyproject.toml` when scaffolding a full
project.

### 6. Verify Before Running Optimization

Add fast checks that load train/validation data, construct the project, verify
the seed candidate keys, and build the target RLM without running a costly
optimization.

Run `uv run rlm-gepa optimize --check` when the project CLI exists. For docs-only
or scaffolding changes, also run markdown sanity checks and `git diff --check`.
61 changes: 61 additions & 0 deletions .agents/skills/rlm-gepa/references/agent-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# AgentSpec

Prefer `agent_spec_from_rlm(...)` for new projects. The RLM stays the source of
truth for the DSPy signature, output schema, skills, and tools.

```python
from rlm_gepa import agent_spec_from_rlm

agent_spec = agent_spec_from_rlm(
build_rlm(SEED_SKILL_INSTRUCTIONS),
use_cases=[
"contract review with clause-level citations",
"invoice analysis with total reconciliation",
],
runtime_grounding_examples={
"skills": ["document-analysis skill instructions are optimized"],
"sandbox facts": ["Pyodide filesystem paths and package limits"],
},
scoring_description=(
"Score combines answer correctness and citation support. Feedback names "
"missing findings, unsupported citations, and extraction errors."
),
)
```

Do not duplicate facts `agent_spec_from_rlm(...)` can derive. Add only context
GEPA cannot infer:

- transfer use cases beyond the benchmark;
- runtime-grounding examples the proposer must preserve;
- scoring signal and evaluator feedback shape;
- anti-overfitting boundaries;
- short product or optimization framing, only when it adds useful context.

Omit `agent_type` by default. Set it only when a concise product or optimization
anchor adds information not already present in the signature, tools, or output
schema.

## Components

`components` names mutable text fields. `seed_candidate()` must return exactly
those keys.

```python
class MyProject(RLMGepaProject):
components = ("skill_instructions",)

def seed_candidate(self) -> dict[str, str]:
return {"skill_instructions": SEED_SKILL_INSTRUCTIONS}
```

Override `component_focus(component_name)` when each component needs a different
proposer brief. Keep component names stable so runs and candidate artifacts are
comparable.

## Proposer Boundaries

Patch-merge/crossover should be evidence-backed behavioral grafting from train
disagreement traces. Avoid broad synthesis, prompt concatenation, source text
imports, or benchmark-specific hacks. Domain specifics belong in `AgentSpec`,
seed/domain skills, runtime-grounding examples, or evaluator feedback.
56 changes: 56 additions & 0 deletions .agents/skills/rlm-gepa/references/data-and-scoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Data And Scoring

Investigate the dataset before writing split or scoring code. Do not treat it
as an opaque list of rows.

Inspect enough examples to identify:

- task types and input sizes;
- label or reference-output shape;
- duplicate or near-duplicate examples;
- missing labels or ambiguous references;
- source grouping keys such as document, user, customer, or task family;
- failure buckets the scorer should expose.

## Split Semantics

Use split names consistently:

- **Train**: examples the optimizer/proposer may use to generate and gate edits.
- **Validation**: examples used for candidate selection and regression checks.
- **Test / held-out eval**: optional final reporting set.

Prefer deterministic splits. Put random seed, split ratio/counts, grouping key,
and sampling limits in `bench/config.py` or `gepa/config.py`. Split by group when
leakage is plausible. Never let near-identical examples from the same source
land in both train and validation without calling it out.

If the dataset is tiny, prefer explicit hand-authored train/validation files
over random splitting.

## Scoring Feedback

Each `evaluate_example()` should return a scalar score plus feedback that helps
the proposer make a targeted behavioral change.

Good feedback names concrete misses:

- missing fields;
- unsupported citations;
- extraction or parsing errors;
- wrong calculations;
- formatting or file-output failures;
- tool-use mistakes visible in traces.

Avoid feedback that only says "wrong" or restates the score. GEPA quality is
bounded by the evidence the metric returns.

## Overfitting Boundaries

State what counts as a transferable improvement versus a benchmark-specific
hack. Examples:

- preserve citation grounding instead of memorizing answer strings;
- improve table handling generally instead of keying on fixture names;
- preserve sandbox path conventions and tool APIs;
- prefer behavior that transfers across document lengths and layouts.
Loading
Loading