- Two end-to-end tracks: a minimal
CartPole-v1track (DQN + LTL monitoring + SMT) and a Safety-Gymnasium track (SafetyPointGoal1-v0, PPO + cost-based safety + local verification/certification). - Runtime monitoring on finite traces: express safety/goal properties as small temporal-logic templates (e.g.,
G(·),F(·)) plus episode-level budgets, and report satisfaction rates and violation diagnostics. - Data-driven risk artifacts: estimate bounded-horizon “unsafe within H steps” risk from rollouts (1D hazard-lidar bin risk and a lightweight learned risk model) for downstream selection and comparison.
- Counterexample-guided improvement loop: use high-risk regions as “counterexamples” to shape retraining via extra penalties and re-evaluate under the same monitoring protocol.
- Local formal checks with explicit assumptions: (i) Z3-based local SMT search for counterexamples under a local linear approximation, and (ii) α,β-CROWN one-step local certification over a PPO policy with a proxy unsafe predicate.
safe-rl/
├── README.md
├── requirements.txt
├── cartpole/
│ ├── stage0_simulation.py # train / rollout controller
│ ├── stage0_plot_training.py
│ ├── stage0_plot_trajectories.py
│ ├── stage1_monitoring.py # trace-based spec monitoring
│ ├── stage2_sampling_verify.py # sampling-based safety map
│ ├── stage2_compare_robustness.py # robustness comparisons
│ ├── stage3_smt_verify.py # local SMT checks
│ ├── stage3_smt_ce_guided.py # CE-guided SMT checks
├── safety_gym/
│ ├── stage0_train.py # train PPO baseline
│ ├── stage0_rollout_and_risk.py # rollouts + risk artifacts
│ ├── stage0_plot_train.py
│ ├── stage1_monitoring.py # runtime monitoring over episode traces
│ ├── stage1_analyze.py
│ ├── stage1_ce_retrain.py # counterexample-guided retraining
│ ├── stage2_abstraction_mc_hazard_lidar.py # 1D abstraction + reachability queries
│ ├── stage3_local_smt.py # local SMT search (risk-feature space)
│ ├── stage3_ppo_crown_one_step.py # α,β-CROWN one-step local certification
│ ├── plot_stage3.py
└── outputs/ # generated outputs (plots, logs, models)
Install Python dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThis track uses:
gymnasiumforCartPole-v1torchfor a small DQN controller (ReLU MLP)z3-solverfor SMT-based Stage 3 experiments
All outputs for this track are written under outputs/cartpole_stage0/.
- Goal: learn a working controller and produce representative rollouts for analysis.
- What happens: a policy is trained in the environment; then trajectories are collected to serve as the “dataset” for later stages.
- Why it matters: later stages treat the learned controller and its rollouts as the object of safety analysis.
- Goal: evaluate a clear safety specification on recorded trajectories.
- What happens: define a safety property (e.g., “always stay within position/angle bounds”) and compute how often it is satisfied, plus when/where violations occur.
- Why it matters: turns “it seems safe” into measurable spec satisfaction metrics.
This track uses a deliberately small LTL fragment to express safety properties and monitor them on finite RL rollouts.
Specification (LTL fragment)
- Formula shape: only top-level
G ψ(“always”) orF ψ(“eventually”) is supported, whereψcontains only boolean structure over atomic propositions. - Atomic proposition: a Python predicate
p(state) -> boolevaluated on each observed state. - In CartPole, the state is a 4D vector
[x, x_dot, theta, theta_dot]. A standard safety atom is:|x| <= x_max ∧ |theta| <= theta_max. - Implementation references:
cartpole/common/specs.py:1.
Monitor (finite-trace semantics)
- Given a finite trajectory
s_0, …, s_{T-1}:G ψis satisfied iffψ(s_t)holds for everyt.F ψis satisfied iffψ(s_t)holds for at least onet.
- Implementation idea (streaming / single pass):
- Evaluate
ψ(s_t)at each step (eval_stateevaluates the non-temporal part). - Maintain a running boolean:
G ψ: starttrue; flip tofalseon the first violation.F ψ: startfalse; flip totrueon the first satisfaction.
- Evaluate
- Implementation references:
cartpole/common/monitors.py:1.
Motivation
- Environment termination is not the same thing as a safety specification: it is a task-specific stopping rule and often too permissive (or simply different) from the safety envelope you care about.
- Stage 1 makes the safety requirement explicit and measurable by checking a specification on trajectories produced by the learned controller.
- It also lets you tighten/loosen the safety margin (e.g., choose stricter bounds than the environment’s default thresholds) and observe how the satisfaction rate changes.
What it computes
- Inputs: trajectories collected in Stage 0 (a list of rollouts; each rollout stores a time series of states plus actions/rewards/dones).
- Specification: build the safety predicate with chosen
(x_max, theta_max)and formφ_safe = G(p_safe). - For each trajectory:
- Run the LTL monitor over the full state sequence to get a final verdict (satisfied/violated).
- Record the first time step that violates the safety predicate (a simple “time-to-violation” statistic).
- Aggregate over trajectories:
- safe ratio (#satisfying / #total), number of unsafe episodes, and simple stats of the first-violation step.
Extra sanity-check view
- Stage 1 also reports a “no-monitor sweep” under the same thresholds: episode-level and step-level violation rates. This gives a second perspective on how strict the chosen safety bounds are (and helps catch edge cases where the spec is too tight/too loose).
- Goal: map out which initial conditions tend to violate the safety spec.
- What happens: systematically sample a grid of initial states, simulate the closed loop for a fixed horizon, and label each initial condition as safe/unsafe under the chosen spec.
- Why it matters: produces an interpretable “unsafe region” approximation and a set of representative counterexamples.
Motivation
- Stage 1 tells you “how often the spec is violated on the trajectories you happened to collect”.
- Stage 2 tries to answer a different question: “from which states does the closed loop tend to become unsafe?”, i.e., give a geometric picture in state space.
- This is useful for:
- identifying vulnerable regions (where safety is brittle),
- extracting counterexamples to guide later formal checks (Stage 3),
- comparing controllers beyond reward (e.g., NN vs a hand-designed baseline).
Problem being approximated
- Ideal (hard) goal: characterize the set of initial states from which the closed-loop system will ever violate the safety spec.
- Stage 2 uses a pragmatic approximation:
- restrict to a low-dimensional slice of the state space (e.g., vary
(x, theta)and fix velocities), - check violations over a finite horizon (bounded-time rather than “forever”),
- label states by simulation outcomes (sampling-based rather than proof-based).
- restrict to a low-dimensional slice of the state space (e.g., vary
What gets labeled as “unsafe”
- Each sampled initial condition is rolled out under the learned controller for a fixed number of steps (or until termination).
- If the safety predicate is violated at any time during this rollout window, the initial condition is labeled unsafe; otherwise safe (for the chosen horizon and thresholds).
- This creates an “unsafe mask” over the sampled grid, which can be visualized as a heatmap/contour.
Counterexamples (why we store them)
- A counterexample here is simply a sampled initial condition that triggers a violation (plus basic context such as the first violation time).
- These counterexamples act as “seeds” for Stage 3: rather than guessing where to formally verify, you verify local regions around empirically bad states.
Robustness / controller comparison (optional extension)
- Beyond a binary safe/unsafe map, Stage 2 can summarize how close states are to the safety boundary (a normalized margin) and plot “margin vs violation probability” curves.
- The same procedure can be run for multiple controllers (e.g., the learned NN controller vs a heuristic baseline) to compare safety robustness across the state space, not just on a single set of rollouts.
- Goal: on a local region of the state space, decide whether an unsafe next step is possible under a formal encoding.
- What happens: encode the controller and a simplified one-step closed-loop model into a constraint solver; either extract a concrete counterexample or conclude no counterexample exists within the queried region (under the modeling assumptions).
- Why it matters: complements Stage 2’s empirical evidence with solver-backed local guarantees (or solver-found counterexamples).
Motivation
- Stage 2 is evidence-based: it can miss rare failures (sampling gaps) and it cannot distinguish “no counterexample found” from “provably none exist”.
- Stage 3 introduces a proof engine (SMT) to make existence statements inside a chosen region: either produce a counterexample (SAT) or rule out counterexamples (UNSAT) under an explicit model.
What is being verified
- Query type: local, bounded safety checks over a region in state space.
- Region: a “box” around a chosen center state (e.g., around states suggested by Stage 2 counterexamples).
- Property: existence of an initial state in the region such that the next step violates the safety bounds (a one-step safety violation query).
Why “one-step” and why “simplified model”
- Directly verifying the full CartPole dynamics and multi-step behavior end-to-end is much harder.
- Stage 3 uses a deliberately simplified one-step closed-loop model to:
- keep the solver problem tractable,
- demonstrate how to encode a neural controller + dynamics + safety predicate as constraints,
- provide quick feedback on whether a region is obviously unsafe under conservative assumptions.
- Interpretation: results are only as sound as the modeling assumptions (dynamics approximation, one-step horizon, region definition).
How to read solver outcomes
- SAT (unsafe exists): the solver returns a concrete state in the region (and an action choice consistent with the controller) that violates the one-step safety bounds under the model.
- This is a constructive counterexample for the modeled system.
- UNSAT (no unsafe exists in region): under the model and within the queried region, no one-step unsafe transition exists.
- This is a local guarantee, not a global or multi-step guarantee.
- UNKNOWN / timeout: the solver could not decide in time; treat as “inconclusive” rather than safe.
CE-guided variant (connecting Stage 2 → Stage 3)
- Instead of manually picking regions, Stage 3 can pick region centers from Stage 2’s sampled counterexamples.
- This makes the formal step targeted: you spend solver effort where the empirical method already suggests risk.
- The resulting SMT counterexamples can then be compared back to sampling results for consistency checks and debugging of modeling assumptions.
This track uses the Safety-Gymnasium environment SafetyPointGoal1-v0, trains a PPO controller, and treats the environment-provided cost (safety cost) as the safety signal throughout the pipeline.
Suggested extra dependencies (in addition to requirements.txt):
safety-gymnasiumstable-baselines3scikit-learn(local linear approximation, risk model fitting)- (optional)
alpha-beta-CROWN(Stage 3 NN bounding/certification experiments)
All artifacts for this track are written under outputs/safety_gym/.
- Goal: obtain a baseline policy and generate reusable rollouts and risk artifacts for later stages.
- What it does :
- Training: train PPO on
SafetyPointGoal1-v0so the policy learns to reach the goal / achieve reward. - Evaluation: on fresh episodes, log basic “task performance vs safety cost” metrics (return, total cost, episode length, collision/trigger counts, etc.).
- Data generation: roll out the fixed baseline to produce trajectories (time series of observations/actions/reward/cost), serving as the dataset for Stage 1–3.
- Risk artifacts: define local risk as “whether a cost event will occur within the next H steps” (bounded-horizon unsafe probability), and estimate it from rollouts:
- An interpretable 1D risk map: extract a hazard-proximity feature from
hazards_lidar, bin it, and estimate risk per bin. - A more flexible risk model: fit a model on compact features to predict unsafe probability, used later to select high-risk local regions.
- An interpretable 1D risk map: extract a hazard-proximity feature from
- Training: train PPO on
- Why it matters: it solidifies “empirical evidence (rollouts) → reusable risk description”, so Stage 1/2/3 can compare, improve, and verify against the same evidence.
The core idea of this stage is to turn “safety” (as expressed by the environment’s cost signal and sensor readings) into a small set of clear, measurable specifications (specs), then run runtime monitoring over full episode traces (finite-trace semantics) to produce comparable safety metrics and diagnostic visualizations.
Inputs
- Policy: the Stage 0 baseline (and, optionally, improved policies from counterexample-guided retraining, evaluated under the same monitoring protocol).
- Data: evaluation episodes in
SafetyPointGoal1-v0; each step provides reward, cost, and observations/sensors (e.g., goal/hazard/obstacle lidar, velocity, etc.).
What Stage 1 does (process/outcomes, not implementation details)
- Define specifications (specs): use a small temporal-logic vocabulary plus episode-level budgets to express “achieve the task while respecting safety/resource limits”.
- Typical temporal templates:
G(·): must hold at every step (always).F(·): must hold at least once (eventually).
- Two main families:
- Task-related: e.g., “eventually reach the goal” (
F(goal_reached)). - Safety/resource-related: e.g., “never incur safety cost” (
G(no_step_cost)), “episode total cost stays within a budget” (total_cost <= C_max), “energy/velocity budgets stay within bounds” (e.g.,total_energy <= E_max,G(speed_limit_ok)), and finer-grained envelopes like “never enter hazards / never collide with obstacles”.
- Task-related: e.g., “eventually reach the goal” (
- Typical temporal templates:
- Monitor each episode trace: treat each episode as a finite trace and check each spec along the trajectory, producing a satisfied/violated verdict per spec per episode.
- Aggregate into comparable metrics:
- Per episode: log task return, total cost, episode length, and spec satisfaction outcomes.
- Across episodes: compute satisfaction rates per spec to answer “how often does this policy meet the safety definition?”
- Produce diagnostic visualizations: put “performance vs safety” into the same view to understand trade-offs, e.g.:
- Bar charts of spec satisfaction rates (which constraints are hardest to satisfy).
- Return–cost scatter plots (colored by safe/unsafe or by violation type) to identify failure modes.
- Enable policy comparison (baseline vs improved policies): when using counterexample-guided retraining or other refinements, Stage 1 provides a consistent evaluation protocol (same specs + same plots) to compare safety improvements and any performance trade-offs.
Why Stage 1 matters
- It elevates “cost observed during training” into an explicit safety definition with reproducible evaluation metrics, instead of relying on vague averages of cost/return.
- It provides the baseline and debugging entry point for Stages 2/3: knowing which specs fail and in what kinds of episodes makes targeted improvement and local formal checks much more effective.
This stage builds a lightweight abstraction of the closed-loop system from data, so we can ask simple “model-checking-style” questions about how likely it is to reach hazardous situations within a bounded horizon—without relying on absolute (x, y) positions (which Safety-Gymnasium does not reliably expose in observations).
Inputs
- Raw rollouts from Stage 0 (full episode traces).
- A 1D, binned “local risk” estimate derived from the same rollouts, using a hazard-proximity feature from
hazards_lidar(bins over[0, 1]). - A risk threshold that defines which abstract bins are considered “hazardous”.
What Stage 2 does (process/outcomes, not implementation details)
- Choose a 1D abstraction feature: map each observation to a scalar “hazard proximity” score extracted from
hazards_lidar(high means close to hazards). - Discretize into bins: quantize the scalar feature into
Bbins, yielding an abstract statez ∈ {0, …, B−1}for every time step of every rollout. - Learn an abstract Markov chain (MC): from the sequence of bins along rollouts, count transitions
z_t → z_{t+1}and normalize them into a transition matrixP.- This yields a data-driven, coarse-grained dynamics model in “hazard-lidar space”.
- Label “hazard” abstract states: use the Stage 0 risk estimate per bin and mark bins whose estimated risk exceeds a threshold as “hazard bins” (optionally treat unvisited/unknown bins as hazardous).
- Answer bounded-horizon reachability queries: compute quantities of the form
P(F<=H hazard)for multiple horizonsH(e.g., 5/10/20/50), which summarize: “starting from the rollout start distribution, what is the probability of hitting a hazard bin within H abstract steps?” - Export an inspectable graph view: export a node table (bin intervals, visit counts, start probability, risk, hazard label), an edge table (observed transitions + probabilities), and a JSON summary for downstream inspection and reuse.
Outputs (under outputs/safety_gym/)
- A node table and edge table describing the learned abstraction graph.
- A summary JSON reporting visited bins, number of hazard bins, and
P(F<=H hazard)across horizons.
Why Stage 2 matters
- It turns per-step rollouts into an interpretable, reusable model of “how the policy moves in hazard proximity space”.
- It provides a bridge between empirical risk estimation and more formal reasoning: you get quantitative, bounded-horizon safety reachability statements that can guide where to focus later, more expensive checks.
Stage 3 focuses on local, assumption-explicit formal checks: rather than trying to prove global safety, it zooms into regions suggested by data (high predicted risk) and asks a solver to either (a) produce a concrete counterexample inside the region, or (b) certify the region safe for a one-step model.
Key idea: local region selection is data-driven
- A learned risk model (from Stage 0 artifacts) is used to score states seen in rollouts.
- The top-risk states (or risk intervals) define local neighborhoods (L∞ boxes) where formal effort is concentrated.
(3A) Local SMT counterexample search (safety_gym/stage3_local_smt.py)
- What is modeled: a local linear closed-loop approximation in a compact risk-feature space:
- pick a center
x0and box radiusδin the risk-feature space, - fit
x_next ≈ A x + cfrom rollout transitions that stay within the box.
- pick a center
- What is “unsafe”: “unsafe within the next
LOCAL_HORIZONsteps”, approximated by the learned logistic risk model. The SMT query checks whether the predicted unsafe probability of the next state exceeds a thresholdp0. - What the solver answers: a satisfiability query of the form:
∃ x in box(x0, δ) s.t. risk_model(x_next) >= p0withx_next = A x + c.- SAT: returns a witness
x*(a concrete counterexample candidate inside the region). - UNSAT: under the linear approximation + risk model, no such one-step “high-risk” successor exists in the queried box.
- UNKNOWN: inconclusive.
- SAT: returns a witness
- Why this is useful: it turns “high predicted risk” into a constructive, checkable counterexample candidate (or a local no-counterexample statement) under explicit modeling assumptions.
(3B) α,β-CROWN one-step local certification (safety_gym/stage3_ppo_crown_one_step.py)
- What is modeled: a one-step closed-loop model combining:
- a locally-fit open-loop linear dynamics
o_next ≈ A o + B a + c, and - the PPO policy’s deterministic action
a = μ(o)(the neural network). This yields a differentiable closed-loop mapg(o) = A o + B μ(o) + csuitable for neural-network bound propagation.
- a locally-fit open-loop linear dynamics
- What is “unsafe”: a conservative proxy predicate on the next observation, because the full learned risk model (with non-linear feature extraction like top-k means/max) is not encoded end-to-end. The proxy used here is:
mean(top-k(hazards_lidar(o_next))) >= threshold.
- What the certifier answers: using α,β-CROWN (via auto_LiRPA), compute sound bounds on
g(o)for alloin the input box, then bound the proxy predicate:- SAFE: the certified upper bound of
mean(top-k(hazards_lidar(o_next)))is below threshold (region is one-step safe under the model). - UNSAFE: the certified lower bound is above threshold (region is provably unsafe under the proxy).
- UNKNOWN: bounds straddle the threshold (inconclusive).
- SAFE: the certified upper bound of
- Why this is useful: it can give a true “for all states in this region” statement (within the one-step linear dynamics approximation and proxy predicate), complementing the SMT-style “there exists a counterexample” search.
Why Stage 3 matters
- It distinguishes “counterexample found”, “locally certified safe”, and “inconclusive”, and makes the assumptions explicit (local region + linearized dynamics + risk/proxy predicate), which is crucial for debugging and for targeted safety improvements.