Skip to content

Tianshi-Xu/Life-Harness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Life-Harness

Adapting the interface, not the model, for deterministic LLM agents

arXiv Benchmarks Model Backbones Settings Improved Training Free

News

  • 2026/05/24: Released the paper and codebase. The second version of the paper has also been submitted to arXiv, and the code release includes the evolution prompts used to build the harness.

Life-Harness overview

Life-Harness is the code release for "Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents." It targets a practical question: when a frozen LLM agent repeatedly fails in a deterministic environment, can we improve the runtime harness around the agent instead of retraining the model or modifying the environment?

The answer is yes. Life-Harness turns recurring failures into reusable runtime interventions across action realization, environment contracts, trajectory regulation, and procedural skills. The model remains frozen; the benchmark environment remains intact; only the harness interface adapts.

Benchmarks Model backbones Settings improved Avg. relative gain Training-free
7 18 116 / 126 88.5% Yes

Why Life-Harness

What changes? What stays fixed? Why it matters
Runtime harness behavior LLM weights No finetuning or model-specific training pipeline
Prompted environment interface Benchmark environment Keeps deterministic evaluation comparable

Results

Across 7 deterministic agent benchmarks and 18 model backbones, Life-Harness improves 116 / 126 model-environment settings, with an 88.5% average relative improvement reported in the paper.

Life-Harness result heatmap

Method

Life-Harness evolves a small set of runtime layers from observed failures, then reuses those layers during evaluation.

Life-Harness method overview

Harness flag Paper layer Runtime role
h2 Action Realization Layer Helps convert model decisions into executable environment actions.
h3 Environment Contract Layer Makes task and environment constraints explicit at runtime.
h4 Trajectory Regulation Layer Regulates multi-step interaction traces to avoid repeated failure patterns.
h5 Procedural Skill Layer Reuses procedural knowledge distilled from recurring successful recoveries.

When the harness is disabled, these layers are not applied.

Benchmarks

This repository keeps the two benchmark families in separate folders because their environments and dependencies are intentionally different.

Suite Environments Start here
AgentBench-style harness ALFWorld, DBBench, OS, WebShop AgentBench/README.md
tau-bench-style harness Airline, Retail, Telecom TauBench/README.md
Life-harness/
  AgentBench/      # Docker-based AgentBench-style tasks
  TauBench/        # uv-based tau-bench-style tasks
  assets/          # README figures

Quick Start

Clone the repository, then enter the benchmark suite you want to run:

cd Life-harness

# tau-bench-style tasks: Airline, Retail, Telecom
cd TauBench

# AgentBench-style tasks: ALFWorld, DBBench, OS, WebShop
cd ../AgentBench

Each subfolder README contains its own environment setup, evaluation commands, and harness switches. API keys and provider URLs should be configured locally through environment variables or .env files; do not commit them.

Star History

Star History Chart

Citation

If you use this repository, please cite the paper:

@article{xu2026adapting,
  title={Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents},
  author={Xu, Tianshi and Wen, Huifeng and Li, Meng},
  journal={arXiv preprint arXiv:2605.22166},
  year={2026}
}

About

Offical implementation of "Life-Harness"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors