This repository contains fully local proof-of-concept implementations for:
Takahashi, K. (2026). Counterfactually Auditable Lifecycle Certification for Autonomous Agents. Zenodo. https://doi.org/10.5281/zenodo.19089134
The repository is intentionally narrow. It implements move-local lifecycle-control experiments on synthetic extraction tasks with exact paired replay and deterministic Python control logic.
It does not claim:
- universal portfolio theory
- open-world off-policy evaluation
- production readiness
- full theorem-complete reproduction of the paper
- long-horizon autonomous-agent guarantees
The repository now contains four base PoCs plus one bridge experiment:
v1: the original minimal local demonstrationv2: a stricter follow-up with regime-mixture testing, learned deterministic routing, bounded main-path certificates, monitoring sweeps, retirement decomposition, and an optional partial-support comparisonv3: a stricter follow-up tov2with a true observational passive baseline, explicit observational vs replay separation, frozen-vs-replanned deployment comparison, drift-type monitoring sweeps, and a more natural partial-support constructionv4: a deterministic theorem-mechanism follow-up focused on supported forecast transport, sign-correct lifecycle envelopes, net-adoption gating, channelwise ledger control, and measurable-space accounting invariancev4_bridge: a bridge experiment that reuses thev3post-drift environment and monitoring outputs, then addsv4-style forecast transport, audit overhead, bridge charge, and channelwise ledger control on the same candidate replanning moves
The extraction-task PoCs v1 to v3 use one local model only in the main path:
gemma3:4b- local Ollama runtime
- no paid API dependency
- no network-dependent evaluation loop
v4 is different. It is a fully deterministic local experiment with no model calls. It is intended to test the paper's distinctive lifecycle-accounting machinery more directly than the extraction-task PoCs can.
The repository is designed to test a restricted claim set. For v1 to v3, the main empirical claims are:
- Passive-only admission reasoning can be unreliable under regime mismatch.
- Exact paired replay can support auditable move-local admission.
- Admission and continued validity are distinct problems.
- Adaptive sentinel-style monitoring can detect controlled post-admission drift.
- Retirement can become favorable after drift.
- Certified stock and actually deployed subset are distinct.
- Budget and compatibility constraints can materially change deployment.
For v4, the structural claims are narrower:
- Supported forecast transport with explicit unsupported slack behaves differently from naive future extrapolation.
- The sign-correct managed-horizon lower envelope differs from naive symmetric survival weighting when the core block is negative.
- Surrogate-positive lifecycle evidence does not by itself imply positive net adoption once overhead and bridge terms are charged.
- Channelwise ledger gating can reject moves that a naive merged-confidence controller would accept.
- Fixed measurable-space accounting can remain invariant under measurable refinements in a controlled toy construction.
In v2, v3, and v4, the goal is not to make the results look better. The goal is to make the tests more falsifiable and less shaped toward a positive outcome.
For v4_bridge, the goal is narrower still: expose where a v3-style post-drift replay replan changes once adoption-aware bridge accounting is added on top of the same candidates, budgets, and drift scenarios.
v1 is the minimal baseline PoC. It shows the basic move-local lifecycle picture on a small synthetic setup.
Main limitations of v1:
- passive failure is tied to split composition
- routing is too hardcoded
- retirement is too close to stock-cost-only
- monitoring is too clean
- the main summary path used a convenient normal approximation
v2 is a scientifically stricter follow-up. It introduces:
- a master synthetic pool plus explicit regime mixtures
- a calibration-based deterministic router
- conservative bounded lower bounds in the main move-certificate path
- monitoring sweeps over exploration floors and drift severities
- retirement decomposition into active routed utility and stock-cost effects
- an optional conservative partial-support comparison
v2 is expected to produce weaker, mixed, or negative results in some settings. That is intentional.
v3 is a narrower but cleaner follow-up to v2. It adds:
- a true observational passive path where
A1is absent from deployment logs - a separate interventional replay path for
+A1and-A1 - explicit schema-versus-semantic drift monitoring sweeps
- frozen pre-drift planning versus post-drift replanning
- a more natural partial-support setting from replay/tool availability
The current local v3 run is intentionally mixed:
- the observational passive estimate is negative while target-regime replay mean is positive
- the main bounded admission certificate remains negative
- the main bounded retirement certificate also remains negative even though the mean move gain is positive
- semantic mild drift is materially harder to detect than schema drift
A2becomes live after drift under replanning, but not in the frozen main pre-drift deployed subset
Those are not bugs to be hidden. They are the point of the stricter design.
v4 is a different kind of follow-up. It does not try to extend the extraction benchmark. Instead, it isolates several mechanisms that are more specific to the paper:
- supported forecast transport with unsupported-mass penalties
- sign-correct managed-horizon lower envelopes
- explicit subtraction of adoption overhead and bridge mismatch
- channelwise lifecycle ledgers rather than naive heterogeneous score merging
- invariance of signed accounting under simple measurable refinements
The current local v4 run is intentionally mechanistic rather than statistical:
- for the fragile profile, naive future extrapolation stays at
0.24even when unsupported mass rises to1.0, while the exact future value falls to0.05and the supported lower falls to-0.09 - for negative core blocks, the sign-correct lower is materially more conservative than naive symmetric survival weighting
- the move
add_fragileis surrogate-positive but net-adoption-negative once overhead and bridge charges are applied - a naive merged-confidence rule accepts a bridge-invalid move with score
0.9245, while the channelwise ledger rejects it - in the controller rollout, the net-adoption policy chooses
A2and realizes higher utility than the surrogate-only or naive-merged policies
These v4 results are not intended as a theorem-complete reproduction. They are controlled illustrations of why the paper separates forecast support, sign-correct horizon weighting, bridge accounting, and heterogeneous ledger channels.
v4_bridge connects the empirical toy path from v3 to the mechanistic accounting path from v4.
It keeps:
v3post-drift monitoring summariesv3exact paired replay on target post-drift regimesv3frozen versus replanned deployment setting- the same small stock and budget-feasible deployment space
It adds:
- a supported forecast transport term derived from replay support and unsupported changed mass
- an explicit audit-overhead term for keeping the fragile unit in stock or live deployment
- a bridge-valid / weak / invalid status derived from monitoring strength, drift type, and unsupported mass
- a naive merged-adoption baseline and a stricter channelwise ledger policy
The current v4_bridge run is mixed on purpose:
- in schema mild, replay-only, naive merged, and ledger net-adoption all retire
A1and switch toA2 - in schema severe, replay-only and naive merged retire
A1, but the ledger policy keeps the pre-drift deployment because net-adoption remains negative - in semantic mild with budget
4, replay-only changes deployment very late, while naive merged and ledger both keep the current deployment - in semantic severe, replay-only retires
A1toB0, but naive merged and ledger do not move
This means the bridge layer is not uniformly improving anything. In some settings it agrees with replay-only replanning, and in others it blocks moves that replay mean alone would accept.
src/lifecycle_poc/: core implementationsrc/lifecycle_poc/experiments/: v1, v2, and v3 experiment entrypointssrc/lifecycle_poc/experiments/: v1 to v4 experiment entrypointsconfigs/default.yaml: v1 configconfigs/v2.yaml: v2 configconfigs/v3.yaml: v3 configconfigs/v4.yaml: v4 configtests/: lightweight unit testsscripts/check_repo_hygiene.py: repository hygiene scan
- Python 3.11+
- Ollama
- local model
gemma3:4b
Check the model locally:
ollama listpython -m venv .venv
.venv\Scripts\Activate.ps1
pip install -e .[dev]Run the full v1 PoC:
python -m lifecycle_poc.experiments.run_allRun the full v2 PoC:
python -m lifecycle_poc.experiments.run_all_v2Run the full v3 PoC:
python -m lifecycle_poc.experiments.run_all_v3Run the full v4 PoC:
python -m lifecycle_poc.experiments.run_all_v4Run the full v4 bridge experiment:
python -m lifecycle_poc.experiments.run_all_v4_bridgeRun v1 phase scripts:
python -m lifecycle_poc.experiments.run_passive_baseline
python -m lifecycle_poc.experiments.run_admission
python -m lifecycle_poc.experiments.run_monitoring
python -m lifecycle_poc.experiments.run_retirement
python -m lifecycle_poc.experiments.run_deploymentRun v2 phase scripts:
python -m lifecycle_poc.experiments.run_passive_regime_mismatch_v2
python -m lifecycle_poc.experiments.run_admission_v2
python -m lifecycle_poc.experiments.run_monitoring_sweep_v2
python -m lifecycle_poc.experiments.run_retirement_v2
python -m lifecycle_poc.experiments.run_deployment_budget_sweep_v2
python -m lifecycle_poc.experiments.run_partial_support_v2Run v3 phase scripts:
python -m lifecycle_poc.experiments.run_passive_observational_v3
python -m lifecycle_poc.experiments.run_admission_v3
python -m lifecycle_poc.experiments.run_monitoring_sweep_v3
python -m lifecycle_poc.experiments.run_retirement_v3
python -m lifecycle_poc.experiments.run_deployment_budget_sweep_v3
python -m lifecycle_poc.experiments.run_deployment_frozen_vs_replanned_v3
python -m lifecycle_poc.experiments.run_partial_support_v3Run v4 phase scripts:
python -m lifecycle_poc.experiments.run_forecast_transport_v4
python -m lifecycle_poc.experiments.run_sign_correct_envelope_v4
python -m lifecycle_poc.experiments.run_adoption_v4
python -m lifecycle_poc.experiments.run_ledger_controller_v4
python -m lifecycle_poc.experiments.run_accounting_refinement_v4Run v4 bridge phase scripts:
python -m lifecycle_poc.experiments.run_bridge_replanning_v4
python -m lifecycle_poc.experiments.run_bridge_drift_compare_v4
python -m lifecycle_poc.experiments.run_bridge_budget_compare_v4Run tests:
python -m pytestRun the hygiene scan:
python scripts/check_repo_hygiene.pyv1 writes outputs under outputs/.
v2 writes separate outputs under outputs/v2/, including:
outputs/v2/reports/run_all_v2_summary.jsonoutputs/v2/reports/passive_regime_mismatch_v2.jsonoutputs/v2/reports/admission_v2_summary.jsonoutputs/v2/reports/monitoring_sweep_v2.csvoutputs/v2/reports/retirement_v2_summary.jsonoutputs/v2/reports/deployment_budget_sweep_v2.csvoutputs/v2/reports/partial_support_v2.json
Generated v2 figures include:
outputs/v2/figures/passive_regime_mismatch_v2.pngoutputs/v2/figures/admission_v2_curve.pngoutputs/v2/figures/monitoring_tradeoff_v2.pngoutputs/v2/figures/retirement_v2_decomposition.pngoutputs/v2/figures/deployment_budget_sweep_v2.pngoutputs/v2/figures/partial_support_v2.png
v3 writes separate outputs under outputs/v3/, including:
outputs/v3/reports/run_all_v3_summary.jsonoutputs/v3/reports/passive_observational_v3_summary.jsonoutputs/v3/reports/admission_v3_summary.jsonoutputs/v3/reports/monitoring_sweep_v3.csvoutputs/v3/reports/retirement_v3_summary.jsonoutputs/v3/reports/deployment_budget_sweep_v3.csvoutputs/v3/reports/deployment_frozen_vs_replanned_v3.jsonoutputs/v3/reports/partial_support_v3.json
Generated v3 figures include:
outputs/v3/figures/passive_observational_v3.pngoutputs/v3/figures/admission_v3_curve.pngoutputs/v3/figures/monitoring_tradeoff_v3.pngoutputs/v3/figures/retirement_v3_decomposition.pngoutputs/v3/figures/retirement_v3_decomposition_mild.pngoutputs/v3/figures/retirement_v3_decomposition_severe.pngoutputs/v3/figures/deployment_budget_sweep_v3.pngoutputs/v3/figures/deployment_frozen_vs_replanned_v3.pngoutputs/v3/figures/deployment_frozen_vs_replanned_v3_mild.pngoutputs/v3/figures/deployment_frozen_vs_replanned_v3_severe.pngoutputs/v3/figures/partial_support_v3.png
v4 writes separate outputs under outputs/v4/, including:
outputs/v4/reports/run_all_v4_summary.jsonoutputs/v4/reports/forecast_transport_v4.jsonoutputs/v4/reports/sign_correct_envelope_v4.jsonoutputs/v4/reports/adoption_v4.jsonoutputs/v4/reports/ledger_gate_v4.jsonoutputs/v4/reports/controller_rollout_v4.jsonoutputs/v4/reports/accounting_refinement_v4.json
Generated v4 figures include:
outputs/v4/figures/forecast_transport_v4.pngoutputs/v4/figures/sign_correct_envelope_v4.pngoutputs/v4/figures/adoption_v4.pngoutputs/v4/figures/ledger_gate_v4.pngoutputs/v4/figures/controller_rollout_v4.pngoutputs/v4/figures/accounting_refinement_v4.png
v4_bridge writes separate outputs under outputs/v4_bridge/, including:
outputs/v4_bridge/reports/run_all_v4_bridge_summary.jsonoutputs/v4_bridge/reports/policy_comparison_v4_bridge.csvoutputs/v4_bridge/reports/move_decomposition_v4_bridge.csvoutputs/v4_bridge/reports/bridge_difference_examples_v4.csvoutputs/v4_bridge/reports/drift_compare_v4_bridge.csvoutputs/v4_bridge/reports/budget_compare_v4_bridge.csv
Generated v4_bridge figures include:
outputs/v4_bridge/figures/policy_comparison_v4_bridge.pngoutputs/v4_bridge/figures/budget_compare_v4_bridge.pngoutputs/v4_bridge/figures/rollout_schema_mild_v4_bridge.pngoutputs/v4_bridge/figures/rollout_schema_severe_v4_bridge.pngoutputs/v4_bridge/figures/rollout_semantic_mild_v4_bridge.pngoutputs/v4_bridge/figures/rollout_semantic_severe_v4_bridge.png
- dataset generation is seed-fixed
- prompts are fixed
- tool behavior is deterministic
- routing and planning are deterministic given calibration data
- outputs are cached by sample and unit
- the main experiment path is fully local
Bitwise reproducibility can still depend on local model-runtime behavior for v1 to v3 and for the replay-backed part of v4_bridge. v4 does not have that dependency because it is deterministic and model-free. To keep the PoCs auditable, the repository stores enough structured logs under outputs/, outputs/v2/, outputs/v3/, outputs/v4/, and outputs/v4_bridge/ to recompute reported statistics from saved artifacts.
The main scientific discipline in this repository is explicit scope control.
- exact paired replay is the core evidential mechanism
- drift is controlled and synthetic
- the monitoring stack is a simplified adaptive sentinel-style monitor, not a theorem-complete implementation
v1keeps a normal approximation in the main summary pathv2moves the bounded conservative lower bound into the main report path and keeps the normal approximation only as supplemental outputv3separates observational passive evidence from replay/interventional evidence and keeps bounded certificates as the main admission and retirement pathv4focuses on structural mechanisms from the paper rather than end-to-end extraction performance, and it does so with deterministic toy constructions rather than learned estimatorsv4_bridgeexplicitly separates:- replay evidence: exact paired replay mean gain and bounded lifecycle lower on the target post-drift regime
- monitoring evidence: detection rate, delay, replan cycle, and monitoring-strength summary from the
v3sentinel sweep - mechanistic assumptions: forecast transport slack, bridge validity / charge, and audit overhead
This means v2 can legitimately produce:
- positive replay means with negative bounded lower bounds
- mixed evidence across regimes
- stronger retirement results than admission results
This means v3 can legitimately produce:
- negative observational passive estimates with positive replay means
- positive replay means with negative bounded lower bounds
- harder semantic-drift detection than schema-drift detection
- replanning gains that appear only after drift evidence is incorporated
This means v4 can legitimately produce:
- positive surrogate lifecycle terms with negative net-adoption terms
- conservative supported-forecast lowers that are much weaker than naive future extrapolation
- channelwise rejection of moves that score well under naive merged confidence
- mechanism-level illustrations that are informative without being statistical performance studies
This means v4_bridge can legitimately produce:
- replay-only replans that are accepted on mean replay gain while their bounded lifecycle lower remains negative
- ledger rejections that are driven partly by mechanistic bridge or overhead assumptions rather than replay alone
- settings where all policies agree, alongside settings where only replay-only or naive merged policies move
Those outcomes should not be interpreted as failures of the repository. They are part of the stricter test design.
- synthetic data only
- one model family only
- single-step extraction only
- no external mutable tool ecosystem
- no theorem-complete implementation of the full paper
- no claim of general causal identification beyond the controlled PoCs
- no claim of real-world operational safety or deployment readiness
- the current
v3certificate sweeps use local-scale sample sizes{8, 16, 24}rather than very large replay batches - the current
v3replanning follow-up covers schema drift, not semantic replanning A2is live after post-drift replanning, but not in the frozen pre-drift main subsetv4is hand-specified and deterministic, so it does not by itself provide statistical evidence about model behavior in the wildv4is not an end-to-end controller benchmark and should be read as a mechanism-isolation supplement tov1tov3v4_bridgestill uses a small synthetic stock and a single deterministic bridge formula family- in
v4_bridge, the replay-only policy selects moves by replay mean gain, while the bounded lifecycle lower is still reported separately and can remain negative v4_bridgedoes not show a universal advantage for the ledger policy; in the current run it agrees with replay-only under schema mild, but blocks moves under harsher settings
- no secrets are required in the main experiment path
- only synthetic data are used
- model weights are not committed
outputs/is ignored by defaultscripts/check_repo_hygiene.pyscans for common path and secret leakage patterns
See SECURITY.md for more detail.
result_summary.md: v1 summaryresult_summary_v2.md: v2 summaryresult_summary_v3.md: v3 summaryresult_summary_v4.md: v4 summaryresult_summary_v4_bridge.md: v4 bridge summary
If you use or extend this repository, cite the paper:
@misc{takahashi2026counterfactual,
author = {Takahashi, K.},
title = {Counterfactually Auditable Lifecycle Certification for Autonomous Agents},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19089134},
url = {https://doi.org/10.5281/zenodo.19089134}
}