shelldweller — the LLM is a Unix device. The agent dwells in the shell.

Shelldweller is sixteen lines of shell. bin/llm exposes a language model as a Unix command — pipe a prompt in, get a response out. bin/shelldweller sends a hint and a task to the model, then pipes whatever the model produces directly to bash. No framework, no tool schema, no planner. The model decides what structure it needs and writes it.

The container gives the model bash, python3, curl, jq, socat, and standard Unix tools. The harness code itself is pure shell. What the model reaches for inside that environment is its own choice.

Thesis

This is an experiment in Substrate Engineering — designing the environment a model inhabits rather than the control structure around it. The distinction matters: most agent work is Harness Engineering, building instructions, state management, and verification loops around the model. Substrate Engineering asks whether those layers are necessary at all, or whether the right substrate makes them emerge on their own.

The thesis: if the substrate is right, the harness becomes unnecessary. The experiment is whether this is true, and what shape the self-built structures take. See docs/substrate-engineering.md.

Quickstart

bin/llm speaks the OpenAI chat completions API (POST /v1/chat/completions). Any server that implements this endpoint works: LM Studio, Ollama, llama.cpp, vLLM, or the OpenAI/Anthropic APIs directly via a compatible proxy. The two env vars you care about:

LLM_ENDPOINT — full URL to the completions endpoint (default: http://host.docker.internal:1234/v1/chat/completions)
LLM_MODEL — model identifier as the server reports it

If your backend uses a different API shape entirely (e.g. a raw text-generation endpoint with no JSON envelope), bin/llm is nine lines of shell — swap the curl call and jq filter to match. Text in, text out is the only contract.

The examples below use LM Studio on the host at port 1234, which is the tested configuration. On Linux, --add-host=host.docker.internal:host-gateway is required so the container can reach the host. Without it you'll get connection refused — this is the most likely first-run failure.

Reasoning models (Qwen3, DeepSeek-R1, etc.). bin/llm automatically strips <think>...</think> blocks before they reach bash — reasoning mode can stay on. Thinking improves response quality and the bridle handles the output.

docker build -t shelldweller .

docker run --rm \
  --read-only --tmpfs /tmp:exec --tmpfs /var/log \
  --memory=2g --cpus=2 \
  --stop-timeout=600 \
  --add-host=host.docker.internal:host-gateway \
  -e LLM_MODEL=qwen/qwen3.6-35b-a3b \
  shelldweller "list files in /etc"

Note --tmpfs /tmp:exec — the model writes and executes scripts from /tmp; the exec flag is required.

With logging:

docker run --rm \
  --read-only --tmpfs /tmp:exec --tmpfs /var/log \
  --memory=2g --cpus=2 \
  --stop-timeout=600 \
  --add-host=host.docker.internal:host-gateway \
  -e LLM_MODEL=qwen/qwen3.6-35b-a3b \
  shelldweller "list files in /etc" 2>&1 | tee run.log

LLM call-level provenance (swap llm for tee pipes, do not bake this in):

echo "$prompt" | tee -a /var/log/llm.in | llm | tee -a /var/log/llm.out

Recursion depth is capped at 4 by default. Override with -e SHELLDWELLER_MAX_DEPTH=8.

What this is not

Not a framework. No agent loop, no tool-calling schema, no planner. The model writes its own loop if it wants one.
Not Python in the harness. The bridle and the LLM device are pure shell. The container includes python3 as a tool the model can reach for — the harness doesn't care what the model uses inside bash.
Not a conversation. No history is passed to the model. Each llm call is stateless. Memory, if any, is the model writing files to /tmp.
Not parsed. The model's output is executed directly as bash. If the model produces garbage, bash fails. That is a finding.
Not persistent. The container is ephemeral (--rm). Nothing survives a run unless the model writes to a host-mounted volume you provide.
Not configurable beyond env vars. LLM_ENDPOINT, LLM_MODEL, and LLM_SYSTEM are the only knobs. Everything else is the model's problem.

Findings

The test suite in tests/ has 20 cases across two tiers, all run against qwen/qwen3.6-35b-a3b — a quantized MoE model that fits on a single RTX 3090, served locally via LM Studio. All 20 pass. Selected result transcripts are in tests/results/. Better results are expected with more capable frontier models; the substrate does not depend on the model.

Baseline tier (cases 01–12)

The model handles one-shot tasks reliably. It writes bash (not sh) by default, uses GNU tool flags, stores state in /tmp unprompted, and recurses via shelldweller when the task calls for it. Across the baseline cases:

Does it write a loop unprompted? Only when the task implies iteration. For single-shot tasks it exits cleanly.
Does it write files to /tmp and read them back? Yes, consistently when state is needed across steps.
Does it use shelldweller recursively? Yes. Tested explicitly in case 06 (delegate a sub-task to a child agent) and implicitly in several harder cases. Recursion depth limiting works.
Does it self-monitor? In multi-step tasks it checks its own outputs before reporting success.

The persistent agent test (case 12) is the standout: the model chose a name ("Axiom"), wrote its identity and memory to /tmp/self/, and on a second run with the same host-mounted volume correctly reintroduced itself and referenced what it had done previously.

Framework tier (cases 13–18)

These cases target patterns that agent frameworks like LangChain and AutoGen are explicitly designed to provide. The model invents all structure itself from bash and llm.

Case	Pattern	What the model did
13 ReAct loop	Thought→Action→Observation	Invented a `THOUGHT:`/`ACTION:` structured prompt protocol, a `DONE count sum` termination signal, and a cycle cap — unprompted
14 Multi-agent debate	Adversarial agents + judge	Spawned two sub-agents with opposing positions on Alpine vs Debian slim, then used a third `llm` call as a structured judge across three criteria
15 Code debug loop	Write→test→fix cycle	Wrote `/tmp/stats.sh`, tested it, got correct output on first attempt; documented the debug path
16 Self-organizing team	Researcher/Developer/Reviewer	Three sequential shelldweller invocations with file handoffs between roles; assembled a final report
17 Long-horizon plan	Five-phase with replanning	Generated a plan, implemented a word frequency analyzer, wrote tests, caught a real test failure in phase 4 and self-corrected without being told how
18 Iterative improvement	Three-version critique loop	Wrote V1, critiqued it, wrote V2, critiqued V2, wrote V3. Ran all three on `/etc/services` (V1: 1037 words, V2/V3: 926). Correctly diagnosed the difference: V1 split on hyphens and counted comment lines

Case 17 phase 4 is the clearest demonstration: the model wrote tests that caught a sorting bug in its own script, called llm to diagnose the failure, patched the script, and reran until all three tests passed. This is the core agent framework loop — plan, execute, observe, replan — implemented in bash from a standing start.

Case 18 shows quality convergence: each critique identified real issues (locale dependency, wc -l newline quirk, grep exit code under set -eo pipefail), and each version addressed them. The final analysis correctly explained why V1 over-counted.

Known failure modes

BusyBox vs GNU tools: Fixed by adding findutils to the image. The model assumes GNU find -printf and -size flags; Alpine ships BusyBox find by default.
Bare === section === headers: The model occasionally writes section headers without echo in complex scripts. Handled by a one-line sed in the bridle before bash execution, and reinforced by a system message constraint.
LM Studio API null responses: DuckDuckGo's instant-answer API returns empty Abstracts for niche queries (e.g. "jq"). Tasks must use topics with known coverage or handle the empty-string case.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
bin		bin
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
shelldweller-spec.md		shelldweller-spec.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

shelldweller — the LLM is a Unix device. The agent dwells in the shell.

Thesis

Quickstart

What this is not

Findings

Baseline tier (cases 01–12)

Framework tier (cases 13–18)

Known failure modes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

shelldweller — the LLM is a Unix device. The agent dwells in the shell.

Thesis

Quickstart

What this is not

Findings

Baseline tier (cases 01–12)

Framework tier (cases 13–18)

Known failure modes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages