Skip to content

fix: multiple eval improvements and bug fixes#364

Merged
JackHopkins merged 10 commits intomainfrom
fix/eval-improvements-0.4.3
Apr 6, 2026
Merged

fix: multiple eval improvements and bug fixes#364
JackHopkins merged 10 commits intomainfrom
fix/eval-improvements-0.4.3

Conversation

@JackHopkins
Copy link
Copy Markdown
Owner

@JackHopkins JackHopkins commented Apr 5, 2026

Summary

Inspect AI Sandbox Evaluation

  • Add colocated Docker sandbox where Factorio server and FLE Python environment run in a single container managed by Inspect's native sandbox() API
  • Bridge architecture: persistent HTTP daemon (bridge_service.py) on Unix socket inside container, thin CLI client (bridge_client.py) invoked via sandbox().exec() from the host-side solver
  • Multi-stage Dockerfile: builds FLE wheel, installs on factoriotools/factorio:2.0.73 base with supervisord managing both Factorio and bridge processes
  • Sandbox solvers (controlled + unbounded) that mirror the integration solvers but communicate through the bridge
  • fle sandbox build CLI command with auto-build on first use of --sandbox
  • fle inspect-eval --sandbox --env-id <task> --model <model> to run evals without any external cluster
  • --scenario flag to configure which Factorio scenario the sandbox loads

Package Reorganisation

  • Move inspect_integration/ -> inspect/integration/
  • Add inspect/sandbox/ for sandbox-specific files
  • Add shared inspect/eval_set.py with DRY task factory functions (create_throughput_task, create_unbounded_production_task) used by both integration and sandbox eval sets
  • Update all imports across the codebase

Agent Namespace Protection

  • Freeze namespace names after tool loading so agent code cannot shadow FLE tools (move_to, place_entity, get_entities, etc.) with function or variable definitions
  • Raises clear NameError with guidance instead of silently breaking eval

Fix print() Capture Inside Agent-Defined Functions

  • Route print() to namespace.log in SerializableFunction.reconstruct() by injecting print=instance.log into the function's globals
  • Reorder globals construction: builtins first, then instance attributes, so namespace methods take precedence
  • Stop AST print->log rewriting inside FunctionDef bodies (caused infinite recursion when agents defined their own log helper that called print)
  • Fix function redefinition bug: read new function from eval_dict (locals) not function_namespace (globals) after exec()
  • Increase STDOUT capture limit from 64 to 512 lines; fix truncation to keep first lines instead of last
  • Add 12 unit tests covering prints in functions, nested functions, try/except, and the common agent safe() wrapper pattern

Prompt Improvements

  • Move policy-writing tips (observe-branch-act, fail fast, diff state, etc.) from task target to system prompt where they belong
  • Task target now contains only the concise goal description

Other Improvements

  • Add FLE_MODS_DIR / FLE_TOOLS_DIR env var overrides to path resolution in rcon.py
  • Handle numpy type serialization in bridge JSON responses (custom _NumpyEncoder)
  • Fix Observation.to_dict() round-trip issues (progress="None" string breaking from_dict())
  • Graceful fallback when vision rendering fails (no sprites installed)
  • Use 2-space indentation instead of tabs in tree observation formatter for consistent rendering
  • Auto-start Factorio cluster when no servers are reachable (integration mode)
  • Fix ContentImage crash in controlled solver: validate data: URI before passing to ContentImage
  • Pass entity dicts instead of repr strings to observation formatter for structured output
  • Consolidate redundant sprite-not-found warnings into single message
  • Add fastapi to main dependencies and new [inspect] optional group

- Auto-start Factorio cluster when no servers are reachable, with
  polling until ready (up to 2 min timeout). Needed count is derived
  from --limit / --max-connections.

- Fix ContentImage crash (IsADirectoryError) in controlled solver:
  validate previous_feedback_image is a data: URI before passing to
  ContentImage, matching the guard already in the unbounded solver.

- Pass entity dicts instead of repr strings to observation formatter
  so TreeObservationFormatter uses its structured dict path and all
  entity fields (position, direction, status, inventory, etc.) are
  shown instead of partial regex-parsed fragments.

- Consolidate 4 redundant sprite-not-found warnings into 1 message.

- Add fastapi to main dependencies and new [inspect] optional group.
uv sync --dev installs [dependency-groups] dev, not
[project.optional-dependencies] dev. pytest-xdist was only in the
latter, causing pytest -n 4 to fail in the factorio-test workflow.
In Factorio 2.0, automation requires automation-science-pack as a
prerequisite, which in turn requires steam-power. Tests that start
with all_technologies_researched=False can no longer research
automation directly. Switch to SteamPower which has no prerequisites.
Extract backslash-containing .replace() call out of f-string to a
local variable, fixing ruff invalid-syntax error on Python 3.10.
Also apply ruff formatting to run.py and solver.py.
- Remove redundant _load_research_state call with empty state in
  instance.reset(). The _reset() call already handles
  all_technologies_researched=False correctly; the extra
  _load_research_state was corrupting the research system by setting
  all tech.researched=false with an empty restore dict, which left
  force.add_research() unable to queue any technology.

- Simplify set_research server.lua: remove manual prerequisite and
  ingredient checks that were too strict for Factorio 2.0's tech
  tree. Keep basic existence/researched/enabled guards for clear
  error messages, and let add_research() handle the rest.

- Revert tests back to Technology.Automation (original assertions).
Sandbox evaluation system:
- Add colocated Docker sandbox where Factorio server and FLE Python
  environment run in a single container managed by Inspect's sandbox() API
- Bridge architecture: persistent HTTP daemon (bridge_service.py) on Unix
  socket inside container, thin CLI client (bridge_client.py) invoked via
  sandbox().exec() from the host-side solver
- Dockerfile (multi-stage): builds FLE wheel, installs on factoriotools
  base image with supervisord managing both Factorio and bridge processes
- Sandbox solvers (controlled + unbounded) that mirror the integration
  solvers but communicate through the bridge
- `fle sandbox build` CLI command with auto-build on first use
- `fle inspect-eval --sandbox` flag to run evals without external cluster

Package reorganisation:
- Move inspect_integration/ -> inspect/integration/
- Move inspect_sandbox/ -> inspect/sandbox/
- Add shared eval_set.py with DRY task factory functions used by both
  integration and sandbox eval sets
- Update all imports across the codebase

Agent namespace protection:
- Freeze namespace names after tool loading so agent code cannot shadow
  FLE tools (move_to, place_entity, etc.) with function/variable defs
- Raises clear NameError instead of silently breaking eval

Fix print() capture inside agent-defined functions:
- Route print() to namespace.log in SerializableFunction.reconstruct()
  by injecting print=instance.log into the function's globals
- Reorder globals: builtins first, then instance attrs, so namespace
  methods take precedence over builtins
- Stop AST print->log rewriting inside FunctionDef bodies (caused
  infinite recursion when agents defined their own 'log' helper)
- Fix function redefinition bug: read new function from eval_dict
  (locals) not function_namespace (globals) after exec()
- Increase STDOUT capture limit from 64 to 512 lines, fix truncation
  to keep first lines instead of last
- Add 12 unit tests covering prints in functions, nested functions,
  try/except, and the common agent safe() wrapper pattern

Other fixes:
- Add FLE_MODS_DIR / FLE_TOOLS_DIR env var overrides to path resolution
- Handle numpy type serialization in bridge JSON responses
- Fix Observation.to_dict() round-trip issues (progress="None" string)
- Graceful fallback when vision rendering fails (no sprites)
- Use 2-space indentation instead of tabs in tree observation formatter
The open_play_production task target now contains only the goal
description. The policy-writing tips (observe-branch-act, fail fast,
diff state, etc.) are moved to the unbounded system prompt where they
belong — they are meta-instructions about how to write code, not what
to build.
@JackHopkins JackHopkins merged commit 18d54b9 into main Apr 6, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant