Counterfactually Auditable Lifecycle Certification PoC

This repository contains fully local proof-of-concept implementations for:

Takahashi, K. (2026). Counterfactually Auditable Lifecycle Certification for Autonomous Agents. Zenodo. https://doi.org/10.5281/zenodo.19089134

Scope

The repository is intentionally narrow. It implements move-local lifecycle-control experiments on synthetic extraction tasks with exact paired replay and deterministic Python control logic.

It does not claim:

universal portfolio theory
open-world off-policy evaluation
production readiness
full theorem-complete reproduction of the paper
long-horizon autonomous-agent guarantees

What Is In The Repository

The repository now contains four base PoCs plus one bridge experiment:

v1: the original minimal local demonstration
v2: a stricter follow-up with regime-mixture testing, learned deterministic routing, bounded main-path certificates, monitoring sweeps, retirement decomposition, and an optional partial-support comparison
v3: a stricter follow-up to v2 with a true observational passive baseline, explicit observational vs replay separation, frozen-vs-replanned deployment comparison, drift-type monitoring sweeps, and a more natural partial-support construction
v4: a deterministic theorem-mechanism follow-up focused on supported forecast transport, sign-correct lifecycle envelopes, net-adoption gating, channelwise ledger control, and measurable-space accounting invariance
v4_bridge: a bridge experiment that reuses the v3 post-drift environment and monitoring outputs, then adds v4-style forecast transport, audit overhead, bridge charge, and channelwise ledger control on the same candidate replanning moves

The extraction-task PoCs v1 to v3 use one local model only in the main path:

gemma3:4b
local Ollama runtime
no paid API dependency
no network-dependent evaluation loop

v4 is different. It is a fully deterministic local experiment with no model calls. It is intended to test the paper's distinctive lifecycle-accounting machinery more directly than the extraction-task PoCs can.

Core Claim Set

The repository is designed to test a restricted claim set. For v1 to v3, the main empirical claims are:

Passive-only admission reasoning can be unreliable under regime mismatch.
Exact paired replay can support auditable move-local admission.
Admission and continued validity are distinct problems.
Adaptive sentinel-style monitoring can detect controlled post-admission drift.
Retirement can become favorable after drift.
Certified stock and actually deployed subset are distinct.
Budget and compatibility constraints can materially change deployment.

For v4, the structural claims are narrower:

Supported forecast transport with explicit unsupported slack behaves differently from naive future extrapolation.
The sign-correct managed-horizon lower envelope differs from naive symmetric survival weighting when the core block is negative.
Surrogate-positive lifecycle evidence does not by itself imply positive net adoption once overhead and bridge terms are charged.
Channelwise ledger gating can reject moves that a naive merged-confidence controller would accept.
Fixed measurable-space accounting can remain invariant under measurable refinements in a controlled toy construction.

In v2, v3, and v4, the goal is not to make the results look better. The goal is to make the tests more falsifiable and less shaped toward a positive outcome.

For v4_bridge, the goal is narrower still: expose where a v3-style post-drift replay replan changes once adoption-aware bridge accounting is added on top of the same candidates, budgets, and drift scenarios.

v1 Summary

v1 is the minimal baseline PoC. It shows the basic move-local lifecycle picture on a small synthetic setup.

Main limitations of v1:

passive failure is tied to split composition
routing is too hardcoded
retirement is too close to stock-cost-only
monitoring is too clean
the main summary path used a convenient normal approximation

v2 Summary

v2 is a scientifically stricter follow-up. It introduces:

a master synthetic pool plus explicit regime mixtures
a calibration-based deterministic router
conservative bounded lower bounds in the main move-certificate path
monitoring sweeps over exploration floors and drift severities
retirement decomposition into active routed utility and stock-cost effects
an optional conservative partial-support comparison

v2 is expected to produce weaker, mixed, or negative results in some settings. That is intentional.

v3 Summary

v3 is a narrower but cleaner follow-up to v2. It adds:

a true observational passive path where A1 is absent from deployment logs
a separate interventional replay path for +A1 and -A1
explicit schema-versus-semantic drift monitoring sweeps
frozen pre-drift planning versus post-drift replanning
a more natural partial-support setting from replay/tool availability

The current local v3 run is intentionally mixed:

the observational passive estimate is negative while target-regime replay mean is positive
the main bounded admission certificate remains negative
the main bounded retirement certificate also remains negative even though the mean move gain is positive
semantic mild drift is materially harder to detect than schema drift
A2 becomes live after drift under replanning, but not in the frozen main pre-drift deployed subset

Those are not bugs to be hidden. They are the point of the stricter design.

v4 Summary

v4 is a different kind of follow-up. It does not try to extend the extraction benchmark. Instead, it isolates several mechanisms that are more specific to the paper:

supported forecast transport with unsupported-mass penalties
sign-correct managed-horizon lower envelopes
explicit subtraction of adoption overhead and bridge mismatch
channelwise lifecycle ledgers rather than naive heterogeneous score merging
invariance of signed accounting under simple measurable refinements

The current local v4 run is intentionally mechanistic rather than statistical:

for the fragile profile, naive future extrapolation stays at 0.24 even when unsupported mass rises to 1.0, while the exact future value falls to 0.05 and the supported lower falls to -0.09
for negative core blocks, the sign-correct lower is materially more conservative than naive symmetric survival weighting
the move add_fragile is surrogate-positive but net-adoption-negative once overhead and bridge charges are applied
a naive merged-confidence rule accepts a bridge-invalid move with score 0.9245, while the channelwise ledger rejects it
in the controller rollout, the net-adoption policy chooses A2 and realizes higher utility than the surrogate-only or naive-merged policies

These v4 results are not intended as a theorem-complete reproduction. They are controlled illustrations of why the paper separates forecast support, sign-correct horizon weighting, bridge accounting, and heterogeneous ledger channels.

v4 Bridge Summary

v4_bridge connects the empirical toy path from v3 to the mechanistic accounting path from v4.

It keeps:

v3 post-drift monitoring summaries
v3 exact paired replay on target post-drift regimes
v3 frozen versus replanned deployment setting
the same small stock and budget-feasible deployment space

It adds:

a supported forecast transport term derived from replay support and unsupported changed mass
an explicit audit-overhead term for keeping the fragile unit in stock or live deployment
a bridge-valid / weak / invalid status derived from monitoring strength, drift type, and unsupported mass
a naive merged-adoption baseline and a stricter channelwise ledger policy

The current v4_bridge run is mixed on purpose:

in schema mild, replay-only, naive merged, and ledger net-adoption all retire A1 and switch to A2
in schema severe, replay-only and naive merged retire A1, but the ledger policy keeps the pre-drift deployment because net-adoption remains negative
in semantic mild with budget 4, replay-only changes deployment very late, while naive merged and ledger both keep the current deployment
in semantic severe, replay-only retires A1 to B0, but naive merged and ledger do not move

This means the bridge layer is not uniformly improving anything. In some settings it agrees with replay-only replanning, and in others it blocks moves that replay mean alone would accept.

Repository Layout

src/lifecycle_poc/: core implementation
src/lifecycle_poc/experiments/: v1, v2, and v3 experiment entrypoints
src/lifecycle_poc/experiments/: v1 to v4 experiment entrypoints
configs/default.yaml: v1 config
configs/v2.yaml: v2 config
configs/v3.yaml: v3 config
configs/v4.yaml: v4 config
tests/: lightweight unit tests
scripts/check_repo_hygiene.py: repository hygiene scan

Local Setup

Requirements

Python 3.11+
Ollama
local model gemma3:4b

Check the model locally:

ollama list

Install

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -e .[dev]

How To Run

Run the full v1 PoC:

python -m lifecycle_poc.experiments.run_all

Run the full v2 PoC:

python -m lifecycle_poc.experiments.run_all_v2

Run the full v3 PoC:

python -m lifecycle_poc.experiments.run_all_v3

Run the full v4 PoC:

python -m lifecycle_poc.experiments.run_all_v4

Run the full v4 bridge experiment:

python -m lifecycle_poc.experiments.run_all_v4_bridge

Run v1 phase scripts:

python -m lifecycle_poc.experiments.run_passive_baseline
python -m lifecycle_poc.experiments.run_admission
python -m lifecycle_poc.experiments.run_monitoring
python -m lifecycle_poc.experiments.run_retirement
python -m lifecycle_poc.experiments.run_deployment

Run v2 phase scripts:

python -m lifecycle_poc.experiments.run_passive_regime_mismatch_v2
python -m lifecycle_poc.experiments.run_admission_v2
python -m lifecycle_poc.experiments.run_monitoring_sweep_v2
python -m lifecycle_poc.experiments.run_retirement_v2
python -m lifecycle_poc.experiments.run_deployment_budget_sweep_v2
python -m lifecycle_poc.experiments.run_partial_support_v2

Run v3 phase scripts:

python -m lifecycle_poc.experiments.run_passive_observational_v3
python -m lifecycle_poc.experiments.run_admission_v3
python -m lifecycle_poc.experiments.run_monitoring_sweep_v3
python -m lifecycle_poc.experiments.run_retirement_v3
python -m lifecycle_poc.experiments.run_deployment_budget_sweep_v3
python -m lifecycle_poc.experiments.run_deployment_frozen_vs_replanned_v3
python -m lifecycle_poc.experiments.run_partial_support_v3

Run v4 phase scripts:

python -m lifecycle_poc.experiments.run_forecast_transport_v4
python -m lifecycle_poc.experiments.run_sign_correct_envelope_v4
python -m lifecycle_poc.experiments.run_adoption_v4
python -m lifecycle_poc.experiments.run_ledger_controller_v4
python -m lifecycle_poc.experiments.run_accounting_refinement_v4

Run v4 bridge phase scripts:

python -m lifecycle_poc.experiments.run_bridge_replanning_v4
python -m lifecycle_poc.experiments.run_bridge_drift_compare_v4
python -m lifecycle_poc.experiments.run_bridge_budget_compare_v4

Run tests:

python -m pytest

Run the hygiene scan:

python scripts/check_repo_hygiene.py

Outputs

v1 writes outputs under outputs/.

v2 writes separate outputs under outputs/v2/, including:

outputs/v2/reports/run_all_v2_summary.json
outputs/v2/reports/passive_regime_mismatch_v2.json
outputs/v2/reports/admission_v2_summary.json
outputs/v2/reports/monitoring_sweep_v2.csv
outputs/v2/reports/retirement_v2_summary.json
outputs/v2/reports/deployment_budget_sweep_v2.csv
outputs/v2/reports/partial_support_v2.json

Generated v2 figures include:

outputs/v2/figures/passive_regime_mismatch_v2.png
outputs/v2/figures/admission_v2_curve.png
outputs/v2/figures/monitoring_tradeoff_v2.png
outputs/v2/figures/retirement_v2_decomposition.png
outputs/v2/figures/deployment_budget_sweep_v2.png
outputs/v2/figures/partial_support_v2.png

v3 writes separate outputs under outputs/v3/, including:

outputs/v3/reports/run_all_v3_summary.json
outputs/v3/reports/passive_observational_v3_summary.json
outputs/v3/reports/admission_v3_summary.json
outputs/v3/reports/monitoring_sweep_v3.csv
outputs/v3/reports/retirement_v3_summary.json
outputs/v3/reports/deployment_budget_sweep_v3.csv
outputs/v3/reports/deployment_frozen_vs_replanned_v3.json
outputs/v3/reports/partial_support_v3.json

Generated v3 figures include:

outputs/v3/figures/passive_observational_v3.png
outputs/v3/figures/admission_v3_curve.png
outputs/v3/figures/monitoring_tradeoff_v3.png
outputs/v3/figures/retirement_v3_decomposition.png
outputs/v3/figures/retirement_v3_decomposition_mild.png
outputs/v3/figures/retirement_v3_decomposition_severe.png
outputs/v3/figures/deployment_budget_sweep_v3.png
outputs/v3/figures/deployment_frozen_vs_replanned_v3.png
outputs/v3/figures/deployment_frozen_vs_replanned_v3_mild.png
outputs/v3/figures/deployment_frozen_vs_replanned_v3_severe.png
outputs/v3/figures/partial_support_v3.png

v4 writes separate outputs under outputs/v4/, including:

outputs/v4/reports/run_all_v4_summary.json
outputs/v4/reports/forecast_transport_v4.json
outputs/v4/reports/sign_correct_envelope_v4.json
outputs/v4/reports/adoption_v4.json
outputs/v4/reports/ledger_gate_v4.json
outputs/v4/reports/controller_rollout_v4.json
outputs/v4/reports/accounting_refinement_v4.json

Generated v4 figures include:

outputs/v4/figures/forecast_transport_v4.png
outputs/v4/figures/sign_correct_envelope_v4.png
outputs/v4/figures/adoption_v4.png
outputs/v4/figures/ledger_gate_v4.png
outputs/v4/figures/controller_rollout_v4.png
outputs/v4/figures/accounting_refinement_v4.png

v4_bridge writes separate outputs under outputs/v4_bridge/, including:

outputs/v4_bridge/reports/run_all_v4_bridge_summary.json
outputs/v4_bridge/reports/policy_comparison_v4_bridge.csv
outputs/v4_bridge/reports/move_decomposition_v4_bridge.csv
outputs/v4_bridge/reports/bridge_difference_examples_v4.csv
outputs/v4_bridge/reports/drift_compare_v4_bridge.csv
outputs/v4_bridge/reports/budget_compare_v4_bridge.csv

Generated v4_bridge figures include:

outputs/v4_bridge/figures/policy_comparison_v4_bridge.png
outputs/v4_bridge/figures/budget_compare_v4_bridge.png
outputs/v4_bridge/figures/rollout_schema_mild_v4_bridge.png
outputs/v4_bridge/figures/rollout_schema_severe_v4_bridge.png
outputs/v4_bridge/figures/rollout_semantic_mild_v4_bridge.png
outputs/v4_bridge/figures/rollout_semantic_severe_v4_bridge.png

Reproducibility Notes

dataset generation is seed-fixed
prompts are fixed
tool behavior is deterministic
routing and planning are deterministic given calibration data
outputs are cached by sample and unit
the main experiment path is fully local

Bitwise reproducibility can still depend on local model-runtime behavior for v1 to v3 and for the replay-backed part of v4_bridge. v4 does not have that dependency because it is deterministic and model-free. To keep the PoCs auditable, the repository stores enough structured logs under outputs/, outputs/v2/, outputs/v3/, outputs/v4/, and outputs/v4_bridge/ to recompute reported statistics from saved artifacts.

Scientific Notes

The main scientific discipline in this repository is explicit scope control.

exact paired replay is the core evidential mechanism
drift is controlled and synthetic
the monitoring stack is a simplified adaptive sentinel-style monitor, not a theorem-complete implementation
v1 keeps a normal approximation in the main summary path
v2 moves the bounded conservative lower bound into the main report path and keeps the normal approximation only as supplemental output
v3 separates observational passive evidence from replay/interventional evidence and keeps bounded certificates as the main admission and retirement path
v4 focuses on structural mechanisms from the paper rather than end-to-end extraction performance, and it does so with deterministic toy constructions rather than learned estimators
v4_bridge explicitly separates:
- replay evidence: exact paired replay mean gain and bounded lifecycle lower on the target post-drift regime
- monitoring evidence: detection rate, delay, replan cycle, and monitoring-strength summary from the v3 sentinel sweep
- mechanistic assumptions: forecast transport slack, bridge validity / charge, and audit overhead

This means v2 can legitimately produce:

positive replay means with negative bounded lower bounds
mixed evidence across regimes
stronger retirement results than admission results

This means v3 can legitimately produce:

negative observational passive estimates with positive replay means
positive replay means with negative bounded lower bounds
harder semantic-drift detection than schema-drift detection
replanning gains that appear only after drift evidence is incorporated

This means v4 can legitimately produce:

positive surrogate lifecycle terms with negative net-adoption terms
conservative supported-forecast lowers that are much weaker than naive future extrapolation
channelwise rejection of moves that score well under naive merged confidence
mechanism-level illustrations that are informative without being statistical performance studies

This means v4_bridge can legitimately produce:

replay-only replans that are accepted on mean replay gain while their bounded lifecycle lower remains negative
ledger rejections that are driven partly by mechanistic bridge or overhead assumptions rather than replay alone
settings where all policies agree, alongside settings where only replay-only or naive merged policies move

Those outcomes should not be interpreted as failures of the repository. They are part of the stricter test design.

Limitations

synthetic data only
one model family only
single-step extraction only
no external mutable tool ecosystem
no theorem-complete implementation of the full paper
no claim of general causal identification beyond the controlled PoCs
no claim of real-world operational safety or deployment readiness
the current v3 certificate sweeps use local-scale sample sizes {8, 16, 24} rather than very large replay batches
the current v3 replanning follow-up covers schema drift, not semantic replanning
A2 is live after post-drift replanning, but not in the frozen pre-drift main subset
v4 is hand-specified and deterministic, so it does not by itself provide statistical evidence about model behavior in the wild
v4 is not an end-to-end controller benchmark and should be read as a mechanism-isolation supplement to v1 to v3
v4_bridge still uses a small synthetic stock and a single deterministic bridge formula family
in v4_bridge, the replay-only policy selects moves by replay mean gain, while the bounded lifecycle lower is still reported separately and can remain negative
v4_bridge does not show a universal advantage for the ledger policy; in the current run it agrees with replay-only under schema mild, but blocks moves under harsher settings

Security And Privacy

no secrets are required in the main experiment path
only synthetic data are used
model weights are not committed
outputs/ is ignored by default
scripts/check_repo_hygiene.py scans for common path and secret leakage patterns

See SECURITY.md for more detail.

Additional Result Summaries

result_summary.md: v1 summary
result_summary_v2.md: v2 summary
result_summary_v3.md: v3 summary
result_summary_v4.md: v4 summary
result_summary_v4_bridge.md: v4 bridge summary

Citation

If you use or extend this repository, cite the paper:

@misc{takahashi2026counterfactual,
  author = {Takahashi, K.},
  title = {Counterfactually Auditable Lifecycle Certification for Autonomous Agents},
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.19089134},
  url = {https://doi.org/10.5281/zenodo.19089134}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Counterfactually Auditable Lifecycle Certification PoC

Scope

What Is In The Repository

Core Claim Set

v1 Summary

v2 Summary

v3 Summary

v4 Summary

v4 Bridge Summary

Repository Layout

Local Setup

Requirements

Install

How To Run

Outputs

Reproducibility Notes

Scientific Notes

Limitations

Security And Privacy

Additional Result Summaries

Citation

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
outputs		outputs
scripts		scripts
src/lifecycle_poc		src/lifecycle_poc
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
result_summary.md		result_summary.md
result_summary_v2.md		result_summary_v2.md
result_summary_v3.md		result_summary_v3.md
result_summary_v4.md		result_summary_v4.md
result_summary_v4_bridge.md		result_summary_v4_bridge.md

Folders and files

Latest commit

History

Repository files navigation

Counterfactually Auditable Lifecycle Certification PoC

Scope

What Is In The Repository

Core Claim Set

v1 Summary

v2 Summary

v3 Summary

v4 Summary

v4 Bridge Summary

Repository Layout

Local Setup

Requirements

Install

How To Run

Outputs

Reproducibility Notes

Scientific Notes

Limitations

Security And Privacy

Additional Result Summaries

Citation

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages