fix: multiple eval improvements and bug fixes by JackHopkins · Pull Request #364 · JackHopkins/factorio-learning-environment

JackHopkins · 2026-04-05T22:59:25Z

Summary

Inspect AI Sandbox Evaluation

Add colocated Docker sandbox where Factorio server and FLE Python environment run in a single container managed by Inspect's native sandbox() API
Bridge architecture: persistent HTTP daemon (bridge_service.py) on Unix socket inside container, thin CLI client (bridge_client.py) invoked via sandbox().exec() from the host-side solver
Multi-stage Dockerfile: builds FLE wheel, installs on factoriotools/factorio:2.0.73 base with supervisord managing both Factorio and bridge processes
Sandbox solvers (controlled + unbounded) that mirror the integration solvers but communicate through the bridge
fle sandbox build CLI command with auto-build on first use of --sandbox
fle inspect-eval --sandbox --env-id <task> --model <model> to run evals without any external cluster
--scenario flag to configure which Factorio scenario the sandbox loads

Package Reorganisation

Move inspect_integration/ -> inspect/integration/
Add inspect/sandbox/ for sandbox-specific files
Add shared inspect/eval_set.py with DRY task factory functions (create_throughput_task, create_unbounded_production_task) used by both integration and sandbox eval sets
Update all imports across the codebase

Agent Namespace Protection

Freeze namespace names after tool loading so agent code cannot shadow FLE tools (move_to, place_entity, get_entities, etc.) with function or variable definitions
Raises clear NameError with guidance instead of silently breaking eval

Fix print() Capture Inside Agent-Defined Functions

Route print() to namespace.log in SerializableFunction.reconstruct() by injecting print=instance.log into the function's globals
Reorder globals construction: builtins first, then instance attributes, so namespace methods take precedence
Stop AST print->log rewriting inside FunctionDef bodies (caused infinite recursion when agents defined their own log helper that called print)
Fix function redefinition bug: read new function from eval_dict (locals) not function_namespace (globals) after exec()
Increase STDOUT capture limit from 64 to 512 lines; fix truncation to keep first lines instead of last
Add 12 unit tests covering prints in functions, nested functions, try/except, and the common agent safe() wrapper pattern

Prompt Improvements

Move policy-writing tips (observe-branch-act, fail fast, diff state, etc.) from task target to system prompt where they belong
Task target now contains only the concise goal description

Other Improvements

Add FLE_MODS_DIR / FLE_TOOLS_DIR env var overrides to path resolution in rcon.py
Handle numpy type serialization in bridge JSON responses (custom _NumpyEncoder)
Fix Observation.to_dict() round-trip issues (progress="None" string breaking from_dict())
Graceful fallback when vision rendering fails (no sprites installed)
Use 2-space indentation instead of tabs in tree observation formatter for consistent rendering
Auto-start Factorio cluster when no servers are reachable (integration mode)
Fix ContentImage crash in controlled solver: validate data: URI before passing to ContentImage
Pass entity dicts instead of repr strings to observation formatter for structured output
Consolidate redundant sprite-not-found warnings into single message
Add fastapi to main dependencies and new [inspect] optional group

- Auto-start Factorio cluster when no servers are reachable, with polling until ready (up to 2 min timeout). Needed count is derived from --limit / --max-connections. - Fix ContentImage crash (IsADirectoryError) in controlled solver: validate previous_feedback_image is a data: URI before passing to ContentImage, matching the guard already in the unbounded solver. - Pass entity dicts instead of repr strings to observation formatter so TreeObservationFormatter uses its structured dict path and all entity fields (position, direction, status, inventory, etc.) are shown instead of partial regex-parsed fragments. - Consolidate 4 redundant sprite-not-found warnings into 1 message. - Add fastapi to main dependencies and new [inspect] optional group.

uv sync --dev installs [dependency-groups] dev, not [project.optional-dependencies] dev. pytest-xdist was only in the latter, causing pytest -n 4 to fail in the factorio-test workflow.

In Factorio 2.0, automation requires automation-science-pack as a prerequisite, which in turn requires steam-power. Tests that start with all_technologies_researched=False can no longer research automation directly. Switch to SteamPower which has no prerequisites.

Extract backslash-containing .replace() call out of f-string to a local variable, fixing ruff invalid-syntax error on Python 3.10. Also apply ruff formatting to run.py and solver.py.

- Remove redundant _load_research_state call with empty state in instance.reset(). The _reset() call already handles all_technologies_researched=False correctly; the extra _load_research_state was corrupting the research system by setting all tech.researched=false with an empty restore dict, which left force.add_research() unable to queue any technology. - Simplify set_research server.lua: remove manual prerequisite and ingredient checks that were too strict for Factorio 2.0's tech tree. Keep basic existence/researched/enabled guards for clear error messages, and let add_research() handle the rest. - Revert tests back to Technology.Automation (original assertions).

Sandbox evaluation system: - Add colocated Docker sandbox where Factorio server and FLE Python environment run in a single container managed by Inspect's sandbox() API - Bridge architecture: persistent HTTP daemon (bridge_service.py) on Unix socket inside container, thin CLI client (bridge_client.py) invoked via sandbox().exec() from the host-side solver - Dockerfile (multi-stage): builds FLE wheel, installs on factoriotools base image with supervisord managing both Factorio and bridge processes - Sandbox solvers (controlled + unbounded) that mirror the integration solvers but communicate through the bridge - `fle sandbox build` CLI command with auto-build on first use - `fle inspect-eval --sandbox` flag to run evals without external cluster Package reorganisation: - Move inspect_integration/ -> inspect/integration/ - Move inspect_sandbox/ -> inspect/sandbox/ - Add shared eval_set.py with DRY task factory functions used by both integration and sandbox eval sets - Update all imports across the codebase Agent namespace protection: - Freeze namespace names after tool loading so agent code cannot shadow FLE tools (move_to, place_entity, etc.) with function/variable defs - Raises clear NameError instead of silently breaking eval Fix print() capture inside agent-defined functions: - Route print() to namespace.log in SerializableFunction.reconstruct() by injecting print=instance.log into the function's globals - Reorder globals: builtins first, then instance attrs, so namespace methods take precedence over builtins - Stop AST print->log rewriting inside FunctionDef bodies (caused infinite recursion when agents defined their own 'log' helper) - Fix function redefinition bug: read new function from eval_dict (locals) not function_namespace (globals) after exec() - Increase STDOUT capture limit from 64 to 512 lines, fix truncation to keep first lines instead of last - Add 12 unit tests covering prints in functions, nested functions, try/except, and the common agent safe() wrapper pattern Other fixes: - Add FLE_MODS_DIR / FLE_TOOLS_DIR env var overrides to path resolution - Handle numpy type serialization in bridge JSON responses - Fix Observation.to_dict() round-trip issues (progress="None" string) - Graceful fallback when vision rendering fails (no sprites) - Use 2-space indentation instead of tabs in tree observation formatter

The open_play_production task target now contains only the goal description. The policy-writing tips (observe-branch-act, fail fast, diff state, etc.) are moved to the unbounded system prompt where they belong — they are meta-instructions about how to write code, not what to build.

JackHopkins added 10 commits April 5, 2026 23:56

fix: add pytest-xdist to dependency-groups so CI can use -n flag

64438fe

uv sync --dev installs [dependency-groups] dev, not [project.optional-dependencies] dev. pytest-xdist was only in the latter, causing pytest -n 4 to fail in the factorio-test workflow.

style: fix ruff formatting and Python 3.10 f-string lint error

692d38a

Extract backslash-containing .replace() call out of f-string to a local variable, fixing ruff invalid-syntax error on Python 3.10. Also apply ruff formatting to run.py and solver.py.

style: fix black formatting across new and modified files

0a4328d

style: fix ruff lint errors (unused imports/variables) and formatting

0c4e5d7

style: remove f-prefix from string without placeholders

ea02402

JackHopkins merged commit 18d54b9 into main Apr 6, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: multiple eval improvements and bug fixes#364

fix: multiple eval improvements and bug fixes#364
JackHopkins merged 10 commits intomainfrom
fix/eval-improvements-0.4.3

JackHopkins commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JackHopkins commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Inspect AI Sandbox Evaluation

Package Reorganisation

Agent Namespace Protection

Fix print() Capture Inside Agent-Defined Functions

Prompt Improvements

Other Improvements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JackHopkins commented Apr 5, 2026 •

edited

Loading