Skip to content

kadubon/agent-lifecycle-certification-poc

Repository files navigation

Counterfactually Auditable Lifecycle Certification PoC

This repository contains fully local proof-of-concept implementations for:

Takahashi, K. (2026). Counterfactually Auditable Lifecycle Certification for Autonomous Agents. Zenodo. https://doi.org/10.5281/zenodo.19089134

Scope

The repository is intentionally narrow. It implements move-local lifecycle-control experiments on synthetic extraction tasks with exact paired replay and deterministic Python control logic.

It does not claim:

  • universal portfolio theory
  • open-world off-policy evaluation
  • production readiness
  • full theorem-complete reproduction of the paper
  • long-horizon autonomous-agent guarantees

What Is In The Repository

The repository now contains four base PoCs plus one bridge experiment:

  • v1: the original minimal local demonstration
  • v2: a stricter follow-up with regime-mixture testing, learned deterministic routing, bounded main-path certificates, monitoring sweeps, retirement decomposition, and an optional partial-support comparison
  • v3: a stricter follow-up to v2 with a true observational passive baseline, explicit observational vs replay separation, frozen-vs-replanned deployment comparison, drift-type monitoring sweeps, and a more natural partial-support construction
  • v4: a deterministic theorem-mechanism follow-up focused on supported forecast transport, sign-correct lifecycle envelopes, net-adoption gating, channelwise ledger control, and measurable-space accounting invariance
  • v4_bridge: a bridge experiment that reuses the v3 post-drift environment and monitoring outputs, then adds v4-style forecast transport, audit overhead, bridge charge, and channelwise ledger control on the same candidate replanning moves

The extraction-task PoCs v1 to v3 use one local model only in the main path:

  • gemma3:4b
  • local Ollama runtime
  • no paid API dependency
  • no network-dependent evaluation loop

v4 is different. It is a fully deterministic local experiment with no model calls. It is intended to test the paper's distinctive lifecycle-accounting machinery more directly than the extraction-task PoCs can.

Core Claim Set

The repository is designed to test a restricted claim set. For v1 to v3, the main empirical claims are:

  1. Passive-only admission reasoning can be unreliable under regime mismatch.
  2. Exact paired replay can support auditable move-local admission.
  3. Admission and continued validity are distinct problems.
  4. Adaptive sentinel-style monitoring can detect controlled post-admission drift.
  5. Retirement can become favorable after drift.
  6. Certified stock and actually deployed subset are distinct.
  7. Budget and compatibility constraints can materially change deployment.

For v4, the structural claims are narrower:

  1. Supported forecast transport with explicit unsupported slack behaves differently from naive future extrapolation.
  2. The sign-correct managed-horizon lower envelope differs from naive symmetric survival weighting when the core block is negative.
  3. Surrogate-positive lifecycle evidence does not by itself imply positive net adoption once overhead and bridge terms are charged.
  4. Channelwise ledger gating can reject moves that a naive merged-confidence controller would accept.
  5. Fixed measurable-space accounting can remain invariant under measurable refinements in a controlled toy construction.

In v2, v3, and v4, the goal is not to make the results look better. The goal is to make the tests more falsifiable and less shaped toward a positive outcome.

For v4_bridge, the goal is narrower still: expose where a v3-style post-drift replay replan changes once adoption-aware bridge accounting is added on top of the same candidates, budgets, and drift scenarios.

v1 Summary

v1 is the minimal baseline PoC. It shows the basic move-local lifecycle picture on a small synthetic setup.

Main limitations of v1:

  • passive failure is tied to split composition
  • routing is too hardcoded
  • retirement is too close to stock-cost-only
  • monitoring is too clean
  • the main summary path used a convenient normal approximation

v2 Summary

v2 is a scientifically stricter follow-up. It introduces:

  • a master synthetic pool plus explicit regime mixtures
  • a calibration-based deterministic router
  • conservative bounded lower bounds in the main move-certificate path
  • monitoring sweeps over exploration floors and drift severities
  • retirement decomposition into active routed utility and stock-cost effects
  • an optional conservative partial-support comparison

v2 is expected to produce weaker, mixed, or negative results in some settings. That is intentional.

v3 Summary

v3 is a narrower but cleaner follow-up to v2. It adds:

  • a true observational passive path where A1 is absent from deployment logs
  • a separate interventional replay path for +A1 and -A1
  • explicit schema-versus-semantic drift monitoring sweeps
  • frozen pre-drift planning versus post-drift replanning
  • a more natural partial-support setting from replay/tool availability

The current local v3 run is intentionally mixed:

  • the observational passive estimate is negative while target-regime replay mean is positive
  • the main bounded admission certificate remains negative
  • the main bounded retirement certificate also remains negative even though the mean move gain is positive
  • semantic mild drift is materially harder to detect than schema drift
  • A2 becomes live after drift under replanning, but not in the frozen main pre-drift deployed subset

Those are not bugs to be hidden. They are the point of the stricter design.

v4 Summary

v4 is a different kind of follow-up. It does not try to extend the extraction benchmark. Instead, it isolates several mechanisms that are more specific to the paper:

  • supported forecast transport with unsupported-mass penalties
  • sign-correct managed-horizon lower envelopes
  • explicit subtraction of adoption overhead and bridge mismatch
  • channelwise lifecycle ledgers rather than naive heterogeneous score merging
  • invariance of signed accounting under simple measurable refinements

The current local v4 run is intentionally mechanistic rather than statistical:

  • for the fragile profile, naive future extrapolation stays at 0.24 even when unsupported mass rises to 1.0, while the exact future value falls to 0.05 and the supported lower falls to -0.09
  • for negative core blocks, the sign-correct lower is materially more conservative than naive symmetric survival weighting
  • the move add_fragile is surrogate-positive but net-adoption-negative once overhead and bridge charges are applied
  • a naive merged-confidence rule accepts a bridge-invalid move with score 0.9245, while the channelwise ledger rejects it
  • in the controller rollout, the net-adoption policy chooses A2 and realizes higher utility than the surrogate-only or naive-merged policies

These v4 results are not intended as a theorem-complete reproduction. They are controlled illustrations of why the paper separates forecast support, sign-correct horizon weighting, bridge accounting, and heterogeneous ledger channels.

v4 Bridge Summary

v4_bridge connects the empirical toy path from v3 to the mechanistic accounting path from v4.

It keeps:

  • v3 post-drift monitoring summaries
  • v3 exact paired replay on target post-drift regimes
  • v3 frozen versus replanned deployment setting
  • the same small stock and budget-feasible deployment space

It adds:

  • a supported forecast transport term derived from replay support and unsupported changed mass
  • an explicit audit-overhead term for keeping the fragile unit in stock or live deployment
  • a bridge-valid / weak / invalid status derived from monitoring strength, drift type, and unsupported mass
  • a naive merged-adoption baseline and a stricter channelwise ledger policy

The current v4_bridge run is mixed on purpose:

  • in schema mild, replay-only, naive merged, and ledger net-adoption all retire A1 and switch to A2
  • in schema severe, replay-only and naive merged retire A1, but the ledger policy keeps the pre-drift deployment because net-adoption remains negative
  • in semantic mild with budget 4, replay-only changes deployment very late, while naive merged and ledger both keep the current deployment
  • in semantic severe, replay-only retires A1 to B0, but naive merged and ledger do not move

This means the bridge layer is not uniformly improving anything. In some settings it agrees with replay-only replanning, and in others it blocks moves that replay mean alone would accept.

Repository Layout

  • src/lifecycle_poc/: core implementation
  • src/lifecycle_poc/experiments/: v1, v2, and v3 experiment entrypoints
  • src/lifecycle_poc/experiments/: v1 to v4 experiment entrypoints
  • configs/default.yaml: v1 config
  • configs/v2.yaml: v2 config
  • configs/v3.yaml: v3 config
  • configs/v4.yaml: v4 config
  • tests/: lightweight unit tests
  • scripts/check_repo_hygiene.py: repository hygiene scan

Local Setup

Requirements

  • Python 3.11+
  • Ollama
  • local model gemma3:4b

Check the model locally:

ollama list

Install

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -e .[dev]

How To Run

Run the full v1 PoC:

python -m lifecycle_poc.experiments.run_all

Run the full v2 PoC:

python -m lifecycle_poc.experiments.run_all_v2

Run the full v3 PoC:

python -m lifecycle_poc.experiments.run_all_v3

Run the full v4 PoC:

python -m lifecycle_poc.experiments.run_all_v4

Run the full v4 bridge experiment:

python -m lifecycle_poc.experiments.run_all_v4_bridge

Run v1 phase scripts:

python -m lifecycle_poc.experiments.run_passive_baseline
python -m lifecycle_poc.experiments.run_admission
python -m lifecycle_poc.experiments.run_monitoring
python -m lifecycle_poc.experiments.run_retirement
python -m lifecycle_poc.experiments.run_deployment

Run v2 phase scripts:

python -m lifecycle_poc.experiments.run_passive_regime_mismatch_v2
python -m lifecycle_poc.experiments.run_admission_v2
python -m lifecycle_poc.experiments.run_monitoring_sweep_v2
python -m lifecycle_poc.experiments.run_retirement_v2
python -m lifecycle_poc.experiments.run_deployment_budget_sweep_v2
python -m lifecycle_poc.experiments.run_partial_support_v2

Run v3 phase scripts:

python -m lifecycle_poc.experiments.run_passive_observational_v3
python -m lifecycle_poc.experiments.run_admission_v3
python -m lifecycle_poc.experiments.run_monitoring_sweep_v3
python -m lifecycle_poc.experiments.run_retirement_v3
python -m lifecycle_poc.experiments.run_deployment_budget_sweep_v3
python -m lifecycle_poc.experiments.run_deployment_frozen_vs_replanned_v3
python -m lifecycle_poc.experiments.run_partial_support_v3

Run v4 phase scripts:

python -m lifecycle_poc.experiments.run_forecast_transport_v4
python -m lifecycle_poc.experiments.run_sign_correct_envelope_v4
python -m lifecycle_poc.experiments.run_adoption_v4
python -m lifecycle_poc.experiments.run_ledger_controller_v4
python -m lifecycle_poc.experiments.run_accounting_refinement_v4

Run v4 bridge phase scripts:

python -m lifecycle_poc.experiments.run_bridge_replanning_v4
python -m lifecycle_poc.experiments.run_bridge_drift_compare_v4
python -m lifecycle_poc.experiments.run_bridge_budget_compare_v4

Run tests:

python -m pytest

Run the hygiene scan:

python scripts/check_repo_hygiene.py

Outputs

v1 writes outputs under outputs/.

v2 writes separate outputs under outputs/v2/, including:

  • outputs/v2/reports/run_all_v2_summary.json
  • outputs/v2/reports/passive_regime_mismatch_v2.json
  • outputs/v2/reports/admission_v2_summary.json
  • outputs/v2/reports/monitoring_sweep_v2.csv
  • outputs/v2/reports/retirement_v2_summary.json
  • outputs/v2/reports/deployment_budget_sweep_v2.csv
  • outputs/v2/reports/partial_support_v2.json

Generated v2 figures include:

  • outputs/v2/figures/passive_regime_mismatch_v2.png
  • outputs/v2/figures/admission_v2_curve.png
  • outputs/v2/figures/monitoring_tradeoff_v2.png
  • outputs/v2/figures/retirement_v2_decomposition.png
  • outputs/v2/figures/deployment_budget_sweep_v2.png
  • outputs/v2/figures/partial_support_v2.png

v3 writes separate outputs under outputs/v3/, including:

  • outputs/v3/reports/run_all_v3_summary.json
  • outputs/v3/reports/passive_observational_v3_summary.json
  • outputs/v3/reports/admission_v3_summary.json
  • outputs/v3/reports/monitoring_sweep_v3.csv
  • outputs/v3/reports/retirement_v3_summary.json
  • outputs/v3/reports/deployment_budget_sweep_v3.csv
  • outputs/v3/reports/deployment_frozen_vs_replanned_v3.json
  • outputs/v3/reports/partial_support_v3.json

Generated v3 figures include:

  • outputs/v3/figures/passive_observational_v3.png
  • outputs/v3/figures/admission_v3_curve.png
  • outputs/v3/figures/monitoring_tradeoff_v3.png
  • outputs/v3/figures/retirement_v3_decomposition.png
  • outputs/v3/figures/retirement_v3_decomposition_mild.png
  • outputs/v3/figures/retirement_v3_decomposition_severe.png
  • outputs/v3/figures/deployment_budget_sweep_v3.png
  • outputs/v3/figures/deployment_frozen_vs_replanned_v3.png
  • outputs/v3/figures/deployment_frozen_vs_replanned_v3_mild.png
  • outputs/v3/figures/deployment_frozen_vs_replanned_v3_severe.png
  • outputs/v3/figures/partial_support_v3.png

v4 writes separate outputs under outputs/v4/, including:

  • outputs/v4/reports/run_all_v4_summary.json
  • outputs/v4/reports/forecast_transport_v4.json
  • outputs/v4/reports/sign_correct_envelope_v4.json
  • outputs/v4/reports/adoption_v4.json
  • outputs/v4/reports/ledger_gate_v4.json
  • outputs/v4/reports/controller_rollout_v4.json
  • outputs/v4/reports/accounting_refinement_v4.json

Generated v4 figures include:

  • outputs/v4/figures/forecast_transport_v4.png
  • outputs/v4/figures/sign_correct_envelope_v4.png
  • outputs/v4/figures/adoption_v4.png
  • outputs/v4/figures/ledger_gate_v4.png
  • outputs/v4/figures/controller_rollout_v4.png
  • outputs/v4/figures/accounting_refinement_v4.png

v4_bridge writes separate outputs under outputs/v4_bridge/, including:

  • outputs/v4_bridge/reports/run_all_v4_bridge_summary.json
  • outputs/v4_bridge/reports/policy_comparison_v4_bridge.csv
  • outputs/v4_bridge/reports/move_decomposition_v4_bridge.csv
  • outputs/v4_bridge/reports/bridge_difference_examples_v4.csv
  • outputs/v4_bridge/reports/drift_compare_v4_bridge.csv
  • outputs/v4_bridge/reports/budget_compare_v4_bridge.csv

Generated v4_bridge figures include:

  • outputs/v4_bridge/figures/policy_comparison_v4_bridge.png
  • outputs/v4_bridge/figures/budget_compare_v4_bridge.png
  • outputs/v4_bridge/figures/rollout_schema_mild_v4_bridge.png
  • outputs/v4_bridge/figures/rollout_schema_severe_v4_bridge.png
  • outputs/v4_bridge/figures/rollout_semantic_mild_v4_bridge.png
  • outputs/v4_bridge/figures/rollout_semantic_severe_v4_bridge.png

Reproducibility Notes

  • dataset generation is seed-fixed
  • prompts are fixed
  • tool behavior is deterministic
  • routing and planning are deterministic given calibration data
  • outputs are cached by sample and unit
  • the main experiment path is fully local

Bitwise reproducibility can still depend on local model-runtime behavior for v1 to v3 and for the replay-backed part of v4_bridge. v4 does not have that dependency because it is deterministic and model-free. To keep the PoCs auditable, the repository stores enough structured logs under outputs/, outputs/v2/, outputs/v3/, outputs/v4/, and outputs/v4_bridge/ to recompute reported statistics from saved artifacts.

Scientific Notes

The main scientific discipline in this repository is explicit scope control.

  • exact paired replay is the core evidential mechanism
  • drift is controlled and synthetic
  • the monitoring stack is a simplified adaptive sentinel-style monitor, not a theorem-complete implementation
  • v1 keeps a normal approximation in the main summary path
  • v2 moves the bounded conservative lower bound into the main report path and keeps the normal approximation only as supplemental output
  • v3 separates observational passive evidence from replay/interventional evidence and keeps bounded certificates as the main admission and retirement path
  • v4 focuses on structural mechanisms from the paper rather than end-to-end extraction performance, and it does so with deterministic toy constructions rather than learned estimators
  • v4_bridge explicitly separates:
    • replay evidence: exact paired replay mean gain and bounded lifecycle lower on the target post-drift regime
    • monitoring evidence: detection rate, delay, replan cycle, and monitoring-strength summary from the v3 sentinel sweep
    • mechanistic assumptions: forecast transport slack, bridge validity / charge, and audit overhead

This means v2 can legitimately produce:

  • positive replay means with negative bounded lower bounds
  • mixed evidence across regimes
  • stronger retirement results than admission results

This means v3 can legitimately produce:

  • negative observational passive estimates with positive replay means
  • positive replay means with negative bounded lower bounds
  • harder semantic-drift detection than schema-drift detection
  • replanning gains that appear only after drift evidence is incorporated

This means v4 can legitimately produce:

  • positive surrogate lifecycle terms with negative net-adoption terms
  • conservative supported-forecast lowers that are much weaker than naive future extrapolation
  • channelwise rejection of moves that score well under naive merged confidence
  • mechanism-level illustrations that are informative without being statistical performance studies

This means v4_bridge can legitimately produce:

  • replay-only replans that are accepted on mean replay gain while their bounded lifecycle lower remains negative
  • ledger rejections that are driven partly by mechanistic bridge or overhead assumptions rather than replay alone
  • settings where all policies agree, alongside settings where only replay-only or naive merged policies move

Those outcomes should not be interpreted as failures of the repository. They are part of the stricter test design.

Limitations

  • synthetic data only
  • one model family only
  • single-step extraction only
  • no external mutable tool ecosystem
  • no theorem-complete implementation of the full paper
  • no claim of general causal identification beyond the controlled PoCs
  • no claim of real-world operational safety or deployment readiness
  • the current v3 certificate sweeps use local-scale sample sizes {8, 16, 24} rather than very large replay batches
  • the current v3 replanning follow-up covers schema drift, not semantic replanning
  • A2 is live after post-drift replanning, but not in the frozen pre-drift main subset
  • v4 is hand-specified and deterministic, so it does not by itself provide statistical evidence about model behavior in the wild
  • v4 is not an end-to-end controller benchmark and should be read as a mechanism-isolation supplement to v1 to v3
  • v4_bridge still uses a small synthetic stock and a single deterministic bridge formula family
  • in v4_bridge, the replay-only policy selects moves by replay mean gain, while the bounded lifecycle lower is still reported separately and can remain negative
  • v4_bridge does not show a universal advantage for the ledger policy; in the current run it agrees with replay-only under schema mild, but blocks moves under harsher settings

Security And Privacy

  • no secrets are required in the main experiment path
  • only synthetic data are used
  • model weights are not committed
  • outputs/ is ignored by default
  • scripts/check_repo_hygiene.py scans for common path and secret leakage patterns

See SECURITY.md for more detail.

Additional Result Summaries

  • result_summary.md: v1 summary
  • result_summary_v2.md: v2 summary
  • result_summary_v3.md: v3 summary
  • result_summary_v4.md: v4 summary
  • result_summary_v4_bridge.md: v4 bridge summary

Citation

If you use or extend this repository, cite the paper:

@misc{takahashi2026counterfactual,
  author = {Takahashi, K.},
  title = {Counterfactually Auditable Lifecycle Certification for Autonomous Agents},
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.19089134},
  url = {https://doi.org/10.5281/zenodo.19089134}
}

About

Public, fully local PoCs for counterfactually auditable lifecycle certification: exact paired replay, drift monitoring, post-drift replanning, and bridge-aware ledger control on synthetic tasks.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages