Deterministic operational replay validation for long-horizon AI agents.
Comptextv7 tests whether compact, replay-safe operational state can preserve workflow continuity across compression, reconstruction, and CI-audited replay checks — without LLM judges, embeddings, vector databases, or external APIs.
Live showcase · Demo walkthrough · Benchmark explanation · Replay report
Long-running agents fail when replayed context becomes operationally untrustworthy:
- constraints disappear;
- blockers detach from tasks;
- tool sequences mutate;
- dependencies collapse;
- summaries sound fluent but lose actionable state.
Comptextv7 focuses on preserving the state needed to continue work, not preserving raw chat history. The project treats replay as an auditable operational-state problem: extract the fields that matter, compact them, reconstruct them, and verify them with deterministic checks.
| Evidence | Current result |
|---|---|
| Paper replay fixtures | 3 dense technical papers |
| Agent trace fixtures | 3 multi-step workflows |
| Paper avg compression | 1.347063 |
| Agent avg compression | 1.773954 |
| Paper replay consistency | 0.791667 |
| Agent replay consistency | 1.000000 |
| Agent operational drift | 0.000000 |
| Evaluation mode | deterministic, no LLM judging |
| Artifact format | committed JSON + CI upload |
Sources: artifacts/paper_replay_results.json and artifacts/agent_trace_replay_results.json.
- Paper replay is lossy under dense technical prose. The current paper fixtures include entities, limitations, sections, and metrics that are harder to preserve after compaction.
- Agent trace replay is currently near-lossless because traces are structured. The checked-in traces expose explicit tasks, blockers, dependencies, tool order, and recovery actions.
1.000000replay consistency does not mean solved memory. It means exact preservation under the current structured trace fixtures and current deterministic validator.- Operational drift is field loss, not subjective quality. A non-zero drift rate would mean replay lost required operational fields.
- Next target is iterative replay degradation. The next milestone is to repeatedly compact and replay state until drift curves and collapse points are visible.
- Not chat-history storage.
- Not vector memory.
- Not model-judged summarization.
- Not autonomous agent orchestration.
- Deterministic operational-state replay validation.
flowchart LR
A[Raw Context / Agent Trace]
--> B[Operational State Extraction]
B --> C[Compact Replay State]
C --> D[Replay Reconstruction]
D --> E[Deterministic Validation]
E --> F[CI Artifact]
Comptextv7 turns noisy context into compact operational state, then validates whether replay reconstructs the fields needed to continue work.
- Validates: whether dense technical paper summaries preserve entities, metrics, limitations, and section structure after deterministic replay compression.
- Artifact:
artifacts/paper_replay_results.json. - Method:
docs/benchmarks/paper_replay.md. - Current avg compression:
1.347063. - Current replay consistency:
0.791667.
- Validates: whether multi-step agent workflows preserve active tasks, constraints, dependencies, tool sequences, unresolved blockers, deployment requirements, and recovery actions.
- Artifact:
artifacts/agent_trace_replay_results.json. - Method:
docs/benchmarks/agent_trace_replay.md. - Current avg compression:
1.773954. - Current replay consistency:
1.000000. - Operational drift:
0.000000. - Interpretation: current setup is near-lossless because the fixtures are structured; this is a useful baseline, not a universal memory claim.
This suite is a separate long-horizon stress surface under reports/replay_continuity/.
It remains useful context, but the focused README narrative is the deterministic operational replay benchmark family above.
| System | Iteration 25 | Iteration 50 | Iteration 100 | Iteration 250 |
|---|---|---|---|---|
| Naive | 0.039 | 0.039 | 0.043 | 0.039 |
| Baseline | 0.294 | 0.294 | 0.294 | 0.294 |
| Adaptive | 0.679 | 0.476 | 0.302 | 0.302 |
| Comptextv7 | 1.000 | 0.995 | 0.824 | 0.572 |
The committed 250-iteration report records Comptextv7 mean final continuity at 0.571783, rounded to 0.572 here.
Detail fidelity still degrades: hidden truth survival is 0.570173, and evaluator agreement divergence is 0.421743.
| System | Approx collapse point |
|---|---|
| Naive | ~1 iteration |
| Baseline | ~10 iterations |
| Adaptive | ~45 iterations |
| Comptextv7 | censored at ~250 iterations in this suite |
replay_degradation_curves.svgcontinuity_half_life_chart.svgsemantic_drift_graph.svgreplay_collapse_curves.svgevaluator_agreement_divergence.svghidden_constraint_survival_curves.svg
- no LLM judging;
- no embeddings;
- no external APIs;
- deterministic JSON artifacts;
- CI reproducible;
- audit-friendly.
- Fixtures are curated and checked in.
- Structured agent traces currently replay near-losslessly.
- This is not solved AI memory.
- This is not production telemetry.
- This is not an autonomous agent framework.
- Evaluator divergence remains material in the long-horizon stress suite.
- A stronger iterative degradation benchmark is the next technical milestone.
Next: iterative replay degradation. Repeatedly compact and replay operational state to expose drift curves, collapse points, and field-level failure modes under pressure.
| Surface | Link |
|---|---|
| Live showcase | comptextv7.vercel.app |
| Demo walkthrough | docs/DEMO_WALKTHROUGH.md |
| Showcase readiness | docs/SHOWCASE_READINESS.md |
| Benchmark explanation | docs/BENCHMARK_EXPLANATION.md |
| Replay report | reports/replay_continuity/validation_report.md |
| API surface | docs/API_SURFACE.md |
Comptextv7/
├── artifacts/ # committed deterministic replay benchmark JSON
├── benchmarks/ # deterministic compression, replay, and audit runners
├── contracts/ # machine-readable validation and handoff contracts
├── dashboard/ # backend plus React operations console
├── docs/ # benchmark, showcase, and reviewer documentation
├── reports/replay_continuity/ # adversarial continuity metrics and SVG charts
├── scripts/ # validation, reporting, and artifact tooling
├── src/ # KVTC engine, audit, and semantic validation modules
├── tests/ # Python regression and replay validation tests
└── README.md
Do not commit:
- proprietary customer data;
- secrets, API keys, tokens, cookies, or credentials;
- raw production logs;
- unsanitized replay fixtures;
- private deployment credentials or environment dumps.
Comptextv7 is a deterministic, synthetic-only research prototype for operational replay persistence and reviewable diagnostic infrastructure.
Comptextv7 is biased toward artifact-backed review rather than local machine trust.
| Workflow | Role |
|---|---|
ci.yml |
Runs deterministic replay, tests, telemetry, and validation gates. |
agent-checks.yml |
Runs repository/report/contract checks plus dashboard validation. |
validation_runner.yml |
Publishes compact cloud validation result artifacts. |
Install the test dependency set:
python -m pip install -e '.[test]'Regenerate deterministic replay artifacts:
python tests/utils/paper_replay_runner.py
python tests/utils/agent_trace_replay_runner.py
python benchmarks/run_replay_continuity.py --iterations 250 --output-dir reports/replay_continuityRun focused checks:
pytest tests/test_paper_replay_bench.py tests/test_agent_trace_replay.py tests/test_replay_continuity.pyRun the broader local gate:
python -m pytest
python scripts/validate.py replay
python scripts/validate.py token
python scripts/validate.py forensic
python scripts/validate_contracts.py
python scripts/validate_api_exports.py