fix: multiple eval improvements and bug fixes#364
Merged
JackHopkins merged 10 commits intomainfrom Apr 6, 2026
Merged
Conversation
- Auto-start Factorio cluster when no servers are reachable, with polling until ready (up to 2 min timeout). Needed count is derived from --limit / --max-connections. - Fix ContentImage crash (IsADirectoryError) in controlled solver: validate previous_feedback_image is a data: URI before passing to ContentImage, matching the guard already in the unbounded solver. - Pass entity dicts instead of repr strings to observation formatter so TreeObservationFormatter uses its structured dict path and all entity fields (position, direction, status, inventory, etc.) are shown instead of partial regex-parsed fragments. - Consolidate 4 redundant sprite-not-found warnings into 1 message. - Add fastapi to main dependencies and new [inspect] optional group.
uv sync --dev installs [dependency-groups] dev, not [project.optional-dependencies] dev. pytest-xdist was only in the latter, causing pytest -n 4 to fail in the factorio-test workflow.
In Factorio 2.0, automation requires automation-science-pack as a prerequisite, which in turn requires steam-power. Tests that start with all_technologies_researched=False can no longer research automation directly. Switch to SteamPower which has no prerequisites.
Extract backslash-containing .replace() call out of f-string to a local variable, fixing ruff invalid-syntax error on Python 3.10. Also apply ruff formatting to run.py and solver.py.
- Remove redundant _load_research_state call with empty state in instance.reset(). The _reset() call already handles all_technologies_researched=False correctly; the extra _load_research_state was corrupting the research system by setting all tech.researched=false with an empty restore dict, which left force.add_research() unable to queue any technology. - Simplify set_research server.lua: remove manual prerequisite and ingredient checks that were too strict for Factorio 2.0's tech tree. Keep basic existence/researched/enabled guards for clear error messages, and let add_research() handle the rest. - Revert tests back to Technology.Automation (original assertions).
Sandbox evaluation system: - Add colocated Docker sandbox where Factorio server and FLE Python environment run in a single container managed by Inspect's sandbox() API - Bridge architecture: persistent HTTP daemon (bridge_service.py) on Unix socket inside container, thin CLI client (bridge_client.py) invoked via sandbox().exec() from the host-side solver - Dockerfile (multi-stage): builds FLE wheel, installs on factoriotools base image with supervisord managing both Factorio and bridge processes - Sandbox solvers (controlled + unbounded) that mirror the integration solvers but communicate through the bridge - `fle sandbox build` CLI command with auto-build on first use - `fle inspect-eval --sandbox` flag to run evals without external cluster Package reorganisation: - Move inspect_integration/ -> inspect/integration/ - Move inspect_sandbox/ -> inspect/sandbox/ - Add shared eval_set.py with DRY task factory functions used by both integration and sandbox eval sets - Update all imports across the codebase Agent namespace protection: - Freeze namespace names after tool loading so agent code cannot shadow FLE tools (move_to, place_entity, etc.) with function/variable defs - Raises clear NameError instead of silently breaking eval Fix print() capture inside agent-defined functions: - Route print() to namespace.log in SerializableFunction.reconstruct() by injecting print=instance.log into the function's globals - Reorder globals: builtins first, then instance attrs, so namespace methods take precedence over builtins - Stop AST print->log rewriting inside FunctionDef bodies (caused infinite recursion when agents defined their own 'log' helper) - Fix function redefinition bug: read new function from eval_dict (locals) not function_namespace (globals) after exec() - Increase STDOUT capture limit from 64 to 512 lines, fix truncation to keep first lines instead of last - Add 12 unit tests covering prints in functions, nested functions, try/except, and the common agent safe() wrapper pattern Other fixes: - Add FLE_MODS_DIR / FLE_TOOLS_DIR env var overrides to path resolution - Handle numpy type serialization in bridge JSON responses - Fix Observation.to_dict() round-trip issues (progress="None" string) - Graceful fallback when vision rendering fails (no sprites) - Use 2-space indentation instead of tabs in tree observation formatter
The open_play_production task target now contains only the goal description. The policy-writing tips (observe-branch-act, fail fast, diff state, etc.) are moved to the unbounded system prompt where they belong — they are meta-instructions about how to write code, not what to build.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Inspect AI Sandbox Evaluation
sandbox()APIbridge_service.py) on Unix socket inside container, thin CLI client (bridge_client.py) invoked viasandbox().exec()from the host-side solverfactoriotools/factorio:2.0.73base with supervisord managing both Factorio and bridge processesfle sandbox buildCLI command with auto-build on first use of--sandboxfle inspect-eval --sandbox --env-id <task> --model <model>to run evals without any external cluster--scenarioflag to configure which Factorio scenario the sandbox loadsPackage Reorganisation
inspect_integration/->inspect/integration/inspect/sandbox/for sandbox-specific filesinspect/eval_set.pywith DRY task factory functions (create_throughput_task,create_unbounded_production_task) used by both integration and sandbox eval setsAgent Namespace Protection
move_to,place_entity,get_entities, etc.) with function or variable definitionsNameErrorwith guidance instead of silently breaking evalFix print() Capture Inside Agent-Defined Functions
print()tonamespace.loginSerializableFunction.reconstruct()by injectingprint=instance.loginto the function's globalsprint->logrewriting insideFunctionDefbodies (caused infinite recursion when agents defined their ownloghelper that calledprint)eval_dict(locals) notfunction_namespace(globals) afterexec()safe()wrapper patternPrompt Improvements
Other Improvements
FLE_MODS_DIR/FLE_TOOLS_DIRenv var overrides to path resolution inrcon.py_NumpyEncoder)Observation.to_dict()round-trip issues (progress="None"string breakingfrom_dict())ContentImagecrash in controlled solver: validatedata:URI before passing toContentImagefastapito main dependencies and new[inspect]optional group