Shelldweller is sixteen lines of shell. bin/llm exposes a language model as a Unix command — pipe a prompt in, get a response out. bin/shelldweller sends a hint and a task to the model, then pipes whatever the model produces directly to bash. No framework, no tool schema, no planner. The model decides what structure it needs and writes it.
The container gives the model bash, python3, curl, jq, socat, and standard Unix tools. The harness code itself is pure shell. What the model reaches for inside that environment is its own choice.
This is an experiment in Substrate Engineering — designing the environment a model inhabits rather than the control structure around it. The distinction matters: most agent work is Harness Engineering, building instructions, state management, and verification loops around the model. Substrate Engineering asks whether those layers are necessary at all, or whether the right substrate makes them emerge on their own.
The thesis: if the substrate is right, the harness becomes unnecessary. The experiment is whether this is true, and what shape the self-built structures take. See docs/substrate-engineering.md.
bin/llm speaks the OpenAI chat completions API (POST /v1/chat/completions). Any server that implements this endpoint works: LM Studio, Ollama, llama.cpp, vLLM, or the OpenAI/Anthropic APIs directly via a compatible proxy. The two env vars you care about:
LLM_ENDPOINT— full URL to the completions endpoint (default:http://host.docker.internal:1234/v1/chat/completions)LLM_MODEL— model identifier as the server reports it
If your backend uses a different API shape entirely (e.g. a raw text-generation endpoint with no JSON envelope), bin/llm is nine lines of shell — swap the curl call and jq filter to match. Text in, text out is the only contract.
The examples below use LM Studio on the host at port 1234, which is the tested configuration. On Linux, --add-host=host.docker.internal:host-gateway is required so the container can reach the host. Without it you'll get connection refused — this is the most likely first-run failure.
Reasoning models (Qwen3, DeepSeek-R1, etc.).
bin/llmautomatically strips<think>...</think>blocks before they reach bash — reasoning mode can stay on. Thinking improves response quality and the bridle handles the output.
docker build -t shelldweller .
docker run --rm \
--read-only --tmpfs /tmp:exec --tmpfs /var/log \
--memory=2g --cpus=2 \
--stop-timeout=600 \
--add-host=host.docker.internal:host-gateway \
-e LLM_MODEL=qwen/qwen3.6-35b-a3b \
shelldweller "list files in /etc"Note --tmpfs /tmp:exec — the model writes and executes scripts from /tmp; the exec flag is required.
With logging:
docker run --rm \
--read-only --tmpfs /tmp:exec --tmpfs /var/log \
--memory=2g --cpus=2 \
--stop-timeout=600 \
--add-host=host.docker.internal:host-gateway \
-e LLM_MODEL=qwen/qwen3.6-35b-a3b \
shelldweller "list files in /etc" 2>&1 | tee run.logLLM call-level provenance (swap llm for tee pipes, do not bake this in):
echo "$prompt" | tee -a /var/log/llm.in | llm | tee -a /var/log/llm.outRecursion depth is capped at 4 by default. Override with -e SHELLDWELLER_MAX_DEPTH=8.
- Not a framework. No agent loop, no tool-calling schema, no planner. The model writes its own loop if it wants one.
- Not Python in the harness. The bridle and the LLM device are pure shell. The container includes python3 as a tool the model can reach for — the harness doesn't care what the model uses inside bash.
- Not a conversation. No history is passed to the model. Each
llmcall is stateless. Memory, if any, is the model writing files to /tmp. - Not parsed. The model's output is executed directly as bash. If the model produces garbage, bash fails. That is a finding.
- Not persistent. The container is ephemeral (
--rm). Nothing survives a run unless the model writes to a host-mounted volume you provide. - Not configurable beyond env vars.
LLM_ENDPOINT,LLM_MODEL, andLLM_SYSTEMare the only knobs. Everything else is the model's problem.
The test suite in tests/ has 20 cases across two tiers, all run against qwen/qwen3.6-35b-a3b — a quantized MoE model that fits on a single RTX 3090, served locally via LM Studio. All 20 pass. Selected result transcripts are in tests/results/. Better results are expected with more capable frontier models; the substrate does not depend on the model.
The model handles one-shot tasks reliably. It writes bash (not sh) by default, uses GNU tool flags, stores state in /tmp unprompted, and recurses via shelldweller when the task calls for it. Across the baseline cases:
- Does it write a loop unprompted? Only when the task implies iteration. For single-shot tasks it exits cleanly.
- Does it write files to /tmp and read them back? Yes, consistently when state is needed across steps.
- Does it use shelldweller recursively? Yes. Tested explicitly in case 06 (delegate a sub-task to a child agent) and implicitly in several harder cases. Recursion depth limiting works.
- Does it self-monitor? In multi-step tasks it checks its own outputs before reporting success.
The persistent agent test (case 12) is the standout: the model chose a name ("Axiom"), wrote its identity and memory to /tmp/self/, and on a second run with the same host-mounted volume correctly reintroduced itself and referenced what it had done previously.
These cases target patterns that agent frameworks like LangChain and AutoGen are explicitly designed to provide. The model invents all structure itself from bash and llm.
| Case | Pattern | What the model did |
|---|---|---|
| 13 ReAct loop | Thought→Action→Observation | Invented a THOUGHT:/ACTION: structured prompt protocol, a DONE count sum termination signal, and a cycle cap — unprompted |
| 14 Multi-agent debate | Adversarial agents + judge | Spawned two sub-agents with opposing positions on Alpine vs Debian slim, then used a third llm call as a structured judge across three criteria |
| 15 Code debug loop | Write→test→fix cycle | Wrote /tmp/stats.sh, tested it, got correct output on first attempt; documented the debug path |
| 16 Self-organizing team | Researcher/Developer/Reviewer | Three sequential shelldweller invocations with file handoffs between roles; assembled a final report |
| 17 Long-horizon plan | Five-phase with replanning | Generated a plan, implemented a word frequency analyzer, wrote tests, caught a real test failure in phase 4 and self-corrected without being told how |
| 18 Iterative improvement | Three-version critique loop | Wrote V1, critiqued it, wrote V2, critiqued V2, wrote V3. Ran all three on /etc/services (V1: 1037 words, V2/V3: 926). Correctly diagnosed the difference: V1 split on hyphens and counted comment lines |
Case 17 phase 4 is the clearest demonstration: the model wrote tests that caught a sorting bug in its own script, called llm to diagnose the failure, patched the script, and reran until all three tests passed. This is the core agent framework loop — plan, execute, observe, replan — implemented in bash from a standing start.
Case 18 shows quality convergence: each critique identified real issues (locale dependency, wc -l newline quirk, grep exit code under set -eo pipefail), and each version addressed them. The final analysis correctly explained why V1 over-counted.
- BusyBox vs GNU tools: Fixed by adding
findutilsto the image. The model assumes GNUfind -printfand-sizeflags; Alpine ships BusyBoxfindby default. - Bare
=== section ===headers: The model occasionally writes section headers withoutechoin complex scripts. Handled by a one-line sed in the bridle before bash execution, and reinforced by a system message constraint. - LM Studio API null responses: DuckDuckGo's instant-answer API returns empty Abstracts for niche queries (e.g. "jq"). Tasks must use topics with known coverage or handle the empty-string case.
