A standalone CLI for running autonomous improvement experiments on any measurable artifact.
Based on Karpathy's AutoResearch pattern, generalized for any domain.
# From source
cargo install --path .
# Or download binary
curl -L https://github.com/hmbldv/sia/releases/latest/download/sia-linux-x86_64 -o ~/.local/bin/sia
chmod +x ~/.local/bin/sia# Improve a prompt file, measuring test success rate
sia run \
--target prompt.md \
--evaluate "python test_prompt.py 2>&1" \
--metric "success_rate: ([\d.]+)" \
--direction maximize \
--max-iterations 20
# Improve code, measuring test pass count
sia run \
--target src/lib.rs \
--evaluate "cargo test 2>&1 | tail -1" \
--metric "(\d+) passed" \
--direction maximize \
--timeout 1h
# Dry run (show plan without executing)
sia plan --target config.yaml --goal "reduce memory usage"
# Resume interrupted session
sia resume --session abc123
# View experiment history
sia history --last 10sia includes a guard layer that runs before every proposed change is applied. This is a first-class feature, not an afterthought.
What the guard blocks:
- Forbidden paths — writes to
/.ssh,/.aws,/.env, and other sensitive directories are rejected outright - Secret pattern detection — API keys, private keys, and JWTs in proposed changes are caught before they reach disk
- Dangerous commands —
rm -rf,curl|sh,sudo, and similar patterns are flagged and blocked - Metacharacter injection — token-level rejection of shell metacharacters in LLM-generated content
Checkpointing: every change — kept or reverted — is committed to git (or backed up if not in a git repo) before application. Full rollback is always available via sia rollback --session <id>.
LOOP until (max_iterations OR timeout OR stop_signal):
1. Read current state (target files, experiment history)
2. Generate hypothesis (LLM proposes change)
3. Checkpoint (git commit)
4. Apply change to target
5. Run evaluation (external command → numeric metric)
6. Compare to baseline:
- IF improved → keep change, update baseline
- ELSE → revert to checkpoint
7. Log result to experiment history
# sia.yaml
target: src/prompt.md
evaluate: "./measure.sh"
metric: "score: ([\d.]+)"
direction: maximize
limits:
max_iterations: 50
timeout_seconds: 3600
max_tokens_per_iteration: 4096
max_eval_seconds: 300
llm:
provider: anthropic # anthropic | openai | ollama
model: claude-sonnet-4
# api_key read from ANTHROPIC_API_KEY env var
goal: "Improve the prompt to get higher task completion rates"
constraints:
- "Keep the prompt under 2000 tokens"
- "Maintain the existing structure"| Provider | Config | API Key Env |
|---|---|---|
| Anthropic | provider: anthropic |
ANTHROPIC_API_KEY |
| OpenAI | provider: openai |
OPENAI_API_KEY |
| Ollama | provider: ollama, base_url: http://localhost:11434 |
— |
sia runs standalone. It can be wired to any scheduler or event system via its exit codes and structured JSON output — a non-zero exit means no improvement was found this iteration.
- Every change is checkpointed (git commit or file backup)
- Full rollback always possible (
sia rollback --session <id>) - Resource limits enforced (iterations, time, tokens, eval time)
- No network access during evaluation by default
MIT