Skip to content

Execution assumption auditing: systematic sensitivity testing #6

@poofeth

Description

@poofeth

Problem

Experiments embed execution assumptions (fill probability, cost models, position sizing, slippage) that are rarely challenged by the framework. These assumptions are often the largest source of gap between backtest and live performance, yet the lab treats them as fixed constants.

In a real deployment, a single uncalibrated assumption (fill probability set to 60% vs a realistic 40%) likely overstated the production champion's Sharpe by 20-30%. This was identified in a manual audit but would never have been caught by the framework.

Proposal

Add an execution_audit diagnostic that runs automatically on a schedule.

Configuration in branches.yaml

execution_audit:
  trigger: "every_10_cycles"
  also_on: "human_checkpoint"
  checks:
    - name: "fill_probability_sensitivity"
      param_path: "execution.fill_probability"
      sweep_values: [0.30, 0.40, 0.50, 0.60, 0.70, 0.80]
      apply_to: "production_champion"
    - name: "cost_model_sensitivity"
      param_path: "execution.cost_per_trade"
      sweep_multipliers: [0.5, 1.0, 1.5, 2.0, 3.0]
      apply_to: "production_champion"
    - name: "position_size_impact"
      param_path: "execution.position_size"
      sweep_values: [50, 100, 500, 1000, 5000]
      apply_to: "production_champion"

Output

A sensitivity table showing how the champion's primary metric degrades:

Fill Probability Sensitivity (champion: cap_v3)
  fill_prob  |  Sharpe  |  Return%  |  PnL
  0.30       |   0.82   |   1.4%    |  $1,400
  0.40       |   1.21   |   3.1%    |  $3,100
  0.50       |   1.67   |   4.8%    |  $4,800
  0.60       |   2.35   |   6.4%    |  $6,400  ← current assumption
  0.70       |   2.89   |   7.9%    |  $7,900

Zero-crossing: Sharpe crosses 0 at fill_prob ≈ 0.15
Robustness: Champion is profitable at all tested fill rates above 0.15

Flagging

If the champion's metric crosses zero at a REALISTIC assumption value (defined per domain), flag in the handoff:

"WARNING: Production champion's Sharpe crosses zero at fill_probability=0.15. If actual fills are below 15%, the strategy loses money. Current assumption (60%) has no empirical calibration."

Integration with judge

Optionally adjust the composite score by a "robustness discount" based on sensitivity:

robustness_discount = (metric_at_pessimistic / metric_at_assumed)
adjusted_score = composite_score * robustness_discount

Why this matters

The gap between backtest and live performance is almost always driven by execution assumptions, not signal quality. A framework that treats these assumptions as immutable constants produces overconfident performance estimates. Systematic auditing makes the numbers honest.

Relationship to existing features

  • Natural fit as a diagnostic experiment type
  • Runs on the same schedule as human checkpoints (every 15 cycles)
  • Output feeds into the handoff and checkpoint reports
  • Could trigger a frame_challenge if sensitivity reveals the champion is fragile

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions