safe-rl: Formal Methods for Safe RL

Features

Two end-to-end tracks: a minimal CartPole-v1 track (DQN + LTL monitoring + SMT) and a Safety-Gymnasium track (SafetyPointGoal1-v0, PPO + cost-based safety + local verification/certification).
Runtime monitoring on finite traces: express safety/goal properties as small temporal-logic templates (e.g., G(·), F(·)) plus episode-level budgets, and report satisfaction rates and violation diagnostics.
Data-driven risk artifacts: estimate bounded-horizon “unsafe within H steps” risk from rollouts (1D hazard-lidar bin risk and a lightweight learned risk model) for downstream selection and comparison.
Counterexample-guided improvement loop: use high-risk regions as “counterexamples” to shape retraining via extra penalties and re-evaluate under the same monitoring protocol.
Local formal checks with explicit assumptions: (i) Z3-based local SMT search for counterexamples under a local linear approximation, and (ii) α,β-CROWN one-step local certification over a PPO policy with a proxy unsafe predicate.

Repository Layout

safe-rl/
├── README.md                      
├── requirements.txt               
├── cartpole/                      
│   ├── stage0_simulation.py              # train / rollout controller
│   ├── stage0_plot_training.py    
│   ├── stage0_plot_trajectories.py
│   ├── stage1_monitoring.py              # trace-based spec monitoring
│   ├── stage2_sampling_verify.py         # sampling-based safety map
│   ├── stage2_compare_robustness.py      # robustness comparisons
│   ├── stage3_smt_verify.py              # local SMT checks
│   ├── stage3_smt_ce_guided.py           # CE-guided SMT checks
├── safety_gym/                    
│   ├── stage0_train.py                   # train PPO baseline
│   ├── stage0_rollout_and_risk.py        # rollouts + risk artifacts
│   ├── stage0_plot_train.py              
│   ├── stage1_monitoring.py              # runtime monitoring over episode traces
│   ├── stage1_analyze.py                 
│   ├── stage1_ce_retrain.py              # counterexample-guided retraining
│   ├── stage2_abstraction_mc_hazard_lidar.py # 1D abstraction + reachability queries
│   ├── stage3_local_smt.py               # local SMT search (risk-feature space)
│   ├── stage3_ppo_crown_one_step.py      # α,β-CROWN one-step local certification
│   ├── plot_stage3.py             
└── outputs/                       # generated outputs (plots, logs, models)

CartPole Track (Gymnasium)

Setting up

Install Python dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

This track uses:

gymnasium for CartPole-v1
torch for a small DQN controller (ReLU MLP)
z3-solver for SMT-based Stage 3 experiments

All outputs for this track are written under outputs/cartpole_stage0/.

Stage 0 — Training + data generation

Goal: learn a working controller and produce representative rollouts for analysis.
What happens: a policy is trained in the environment; then trajectories are collected to serve as the “dataset” for later stages.
Why it matters: later stages treat the learned controller and its rollouts as the object of safety analysis.

Stage 1 — Runtime monitoring (spec satisfaction on traces)

Goal: evaluate a clear safety specification on recorded trajectories.
What happens: define a safety property (e.g., “always stay within position/angle bounds”) and compute how often it is satisfied, plus when/where violations occur.
Why it matters: turns “it seems safe” into measurable spec satisfaction metrics.

LTL Monitor and Specification

This track uses a deliberately small LTL fragment to express safety properties and monitor them on finite RL rollouts.

Specification (LTL fragment)

Formula shape: only top-level G ψ (“always”) or F ψ (“eventually”) is supported, where ψ contains only boolean structure over atomic propositions.
Atomic proposition: a Python predicate p(state) -> bool evaluated on each observed state.
In CartPole, the state is a 4D vector [x, x_dot, theta, theta_dot]. A standard safety atom is: |x| <= x_max ∧ |theta| <= theta_max.
Implementation references: cartpole/common/specs.py:1.

Monitor (finite-trace semantics)

Given a finite trajectory s_0, …, s_{T-1}:
- G ψ is satisfied iff ψ(s_t) holds for every t.
- F ψ is satisfied iff ψ(s_t) holds for at least one t.
Implementation idea (streaming / single pass):
- Evaluate ψ(s_t) at each step (eval_state evaluates the non-temporal part).
- Maintain a running boolean:
  - G ψ: start true; flip to false on the first violation.
  - F ψ: start false; flip to true on the first satisfaction.
Implementation references: cartpole/common/monitors.py:1.

What stage1 does

Motivation

Environment termination is not the same thing as a safety specification: it is a task-specific stopping rule and often too permissive (or simply different) from the safety envelope you care about.
Stage 1 makes the safety requirement explicit and measurable by checking a specification on trajectories produced by the learned controller.
It also lets you tighten/loosen the safety margin (e.g., choose stricter bounds than the environment’s default thresholds) and observe how the satisfaction rate changes.

What it computes

Inputs: trajectories collected in Stage 0 (a list of rollouts; each rollout stores a time series of states plus actions/rewards/dones).
Specification: build the safety predicate with chosen (x_max, theta_max) and form φ_safe = G(p_safe).
For each trajectory:
- Run the LTL monitor over the full state sequence to get a final verdict (satisfied/violated).
- Record the first time step that violates the safety predicate (a simple “time-to-violation” statistic).
Aggregate over trajectories:
- safe ratio (#satisfying / #total), number of unsafe episodes, and simple stats of the first-violation step.

Extra sanity-check view

Stage 1 also reports a “no-monitor sweep” under the same thresholds: episode-level and step-level violation rates. This gives a second perspective on how strict the chosen safety bounds are (and helps catch edge cases where the spec is too tight/too loose).

Stage 2 — Sampling-based approximate verification (state-space picture)

Goal: map out which initial conditions tend to violate the safety spec.
What happens: systematically sample a grid of initial states, simulate the closed loop for a fixed horizon, and label each initial condition as safe/unsafe under the chosen spec.
Why it matters: produces an interpretable “unsafe region” approximation and a set of representative counterexamples.

Stage 2 details: what is being approximated and why

Motivation

Stage 1 tells you “how often the spec is violated on the trajectories you happened to collect”.
Stage 2 tries to answer a different question: “from which states does the closed loop tend to become unsafe?”, i.e., give a geometric picture in state space.
This is useful for:
- identifying vulnerable regions (where safety is brittle),
- extracting counterexamples to guide later formal checks (Stage 3),
- comparing controllers beyond reward (e.g., NN vs a hand-designed baseline).

Problem being approximated

Ideal (hard) goal: characterize the set of initial states from which the closed-loop system will ever violate the safety spec.
Stage 2 uses a pragmatic approximation:
- restrict to a low-dimensional slice of the state space (e.g., vary (x, theta) and fix velocities),
- check violations over a finite horizon (bounded-time rather than “forever”),
- label states by simulation outcomes (sampling-based rather than proof-based).

What gets labeled as “unsafe”

Each sampled initial condition is rolled out under the learned controller for a fixed number of steps (or until termination).
If the safety predicate is violated at any time during this rollout window, the initial condition is labeled unsafe; otherwise safe (for the chosen horizon and thresholds).
This creates an “unsafe mask” over the sampled grid, which can be visualized as a heatmap/contour.

Counterexamples (why we store them)

A counterexample here is simply a sampled initial condition that triggers a violation (plus basic context such as the first violation time).
These counterexamples act as “seeds” for Stage 3: rather than guessing where to formally verify, you verify local regions around empirically bad states.

Robustness / controller comparison (optional extension)

Beyond a binary safe/unsafe map, Stage 2 can summarize how close states are to the safety boundary (a normalized margin) and plot “margin vs violation probability” curves.
The same procedure can be run for multiple controllers (e.g., the learned NN controller vs a heuristic baseline) to compare safety robustness across the state space, not just on a single set of rollouts.

Stage 3 — SMT-based local checks (proof/CE on simplified model)

Goal: on a local region of the state space, decide whether an unsafe next step is possible under a formal encoding.
What happens: encode the controller and a simplified one-step closed-loop model into a constraint solver; either extract a concrete counterexample or conclude no counterexample exists within the queried region (under the modeling assumptions).
Why it matters: complements Stage 2’s empirical evidence with solver-backed local guarantees (or solver-found counterexamples).

Stage 3 details: what the solver proves (and what it does not)

Motivation

Stage 2 is evidence-based: it can miss rare failures (sampling gaps) and it cannot distinguish “no counterexample found” from “provably none exist”.
Stage 3 introduces a proof engine (SMT) to make existence statements inside a chosen region: either produce a counterexample (SAT) or rule out counterexamples (UNSAT) under an explicit model.

What is being verified

Query type: local, bounded safety checks over a region in state space.
Region: a “box” around a chosen center state (e.g., around states suggested by Stage 2 counterexamples).
Property: existence of an initial state in the region such that the next step violates the safety bounds (a one-step safety violation query).

Why “one-step” and why “simplified model”

Directly verifying the full CartPole dynamics and multi-step behavior end-to-end is much harder.
Stage 3 uses a deliberately simplified one-step closed-loop model to:
- keep the solver problem tractable,
- demonstrate how to encode a neural controller + dynamics + safety predicate as constraints,
- provide quick feedback on whether a region is obviously unsafe under conservative assumptions.
Interpretation: results are only as sound as the modeling assumptions (dynamics approximation, one-step horizon, region definition).

How to read solver outcomes

SAT (unsafe exists): the solver returns a concrete state in the region (and an action choice consistent with the controller) that violates the one-step safety bounds under the model.
- This is a constructive counterexample for the modeled system.
UNSAT (no unsafe exists in region): under the model and within the queried region, no one-step unsafe transition exists.
- This is a local guarantee, not a global or multi-step guarantee.
UNKNOWN / timeout: the solver could not decide in time; treat as “inconclusive” rather than safe.

CE-guided variant (connecting Stage 2 → Stage 3)

Instead of manually picking regions, Stage 3 can pick region centers from Stage 2’s sampled counterexamples.
This makes the formal step targeted: you spend solver effort where the empirical method already suggests risk.
The resulting SMT counterexamples can then be compared back to sampling results for consistency checks and debugging of modeling assumptions.

Safety Gymnasium Track (SafetyPointGoal1-v0)

Setting up

This track uses the Safety-Gymnasium environment SafetyPointGoal1-v0, trains a PPO controller, and treats the environment-provided cost (safety cost) as the safety signal throughout the pipeline.

Suggested extra dependencies (in addition to requirements.txt):

safety-gymnasium
stable-baselines3
scikit-learn (local linear approximation, risk model fitting)
(optional) alpha-beta-CROWN (Stage 3 NN bounding/certification experiments)

All artifacts for this track are written under outputs/safety_gym/.

Stage 0 — Train a baseline + generate downstream artifacts

Goal: obtain a baseline policy and generate reusable rollouts and risk artifacts for later stages.
What it does :
- Training: train PPO on SafetyPointGoal1-v0 so the policy learns to reach the goal / achieve reward.
- Evaluation: on fresh episodes, log basic “task performance vs safety cost” metrics (return, total cost, episode length, collision/trigger counts, etc.).
- Data generation: roll out the fixed baseline to produce trajectories (time series of observations/actions/reward/cost), serving as the dataset for Stage 1–3.
- Risk artifacts: define local risk as “whether a cost event will occur within the next H steps” (bounded-horizon unsafe probability), and estimate it from rollouts:
  - An interpretable 1D risk map: extract a hazard-proximity feature from hazards_lidar, bin it, and estimate risk per bin.
  - A more flexible risk model: fit a model on compact features to predict unsafe probability, used later to select high-risk local regions.
Why it matters: it solidifies “empirical evidence (rollouts) → reusable risk description”, so Stage 1/2/3 can compare, improve, and verify against the same evidence.

Stage 1

The core idea of this stage is to turn “safety” (as expressed by the environment’s cost signal and sensor readings) into a small set of clear, measurable specifications (specs), then run runtime monitoring over full episode traces (finite-trace semantics) to produce comparable safety metrics and diagnostic visualizations.

Inputs

Policy: the Stage 0 baseline (and, optionally, improved policies from counterexample-guided retraining, evaluated under the same monitoring protocol).
Data: evaluation episodes in SafetyPointGoal1-v0; each step provides reward, cost, and observations/sensors (e.g., goal/hazard/obstacle lidar, velocity, etc.).

What Stage 1 does (process/outcomes, not implementation details)

Define specifications (specs): use a small temporal-logic vocabulary plus episode-level budgets to express “achieve the task while respecting safety/resource limits”.
- Typical temporal templates:
  - G(·): must hold at every step (always).
  - F(·): must hold at least once (eventually).
- Two main families:
  - Task-related: e.g., “eventually reach the goal” (F(goal_reached)).
  - Safety/resource-related: e.g., “never incur safety cost” (G(no_step_cost)), “episode total cost stays within a budget” (total_cost <= C_max), “energy/velocity budgets stay within bounds” (e.g., total_energy <= E_max, G(speed_limit_ok)), and finer-grained envelopes like “never enter hazards / never collide with obstacles”.
Monitor each episode trace: treat each episode as a finite trace and check each spec along the trajectory, producing a satisfied/violated verdict per spec per episode.
Aggregate into comparable metrics:
- Per episode: log task return, total cost, episode length, and spec satisfaction outcomes.
- Across episodes: compute satisfaction rates per spec to answer “how often does this policy meet the safety definition?”
Produce diagnostic visualizations: put “performance vs safety” into the same view to understand trade-offs, e.g.:
- Bar charts of spec satisfaction rates (which constraints are hardest to satisfy).
- Return–cost scatter plots (colored by safe/unsafe or by violation type) to identify failure modes.
Enable policy comparison (baseline vs improved policies): when using counterexample-guided retraining or other refinements, Stage 1 provides a consistent evaluation protocol (same specs + same plots) to compare safety improvements and any performance trade-offs.

Why Stage 1 matters

It elevates “cost observed during training” into an explicit safety definition with reproducible evaluation metrics, instead of relying on vague averages of cost/return.
It provides the baseline and debugging entry point for Stages 2/3: knowing which specs fail and in what kinds of episodes makes targeted improvement and local formal checks much more effective.

Stage 2

This stage builds a lightweight abstraction of the closed-loop system from data, so we can ask simple “model-checking-style” questions about how likely it is to reach hazardous situations within a bounded horizon—without relying on absolute (x, y) positions (which Safety-Gymnasium does not reliably expose in observations).

Inputs

Raw rollouts from Stage 0 (full episode traces).
A 1D, binned “local risk” estimate derived from the same rollouts, using a hazard-proximity feature from hazards_lidar (bins over [0, 1]).
A risk threshold that defines which abstract bins are considered “hazardous”.

What Stage 2 does (process/outcomes, not implementation details)

Choose a 1D abstraction feature: map each observation to a scalar “hazard proximity” score extracted from hazards_lidar (high means close to hazards).
Discretize into bins: quantize the scalar feature into B bins, yielding an abstract state z ∈ {0, …, B−1} for every time step of every rollout.
Learn an abstract Markov chain (MC): from the sequence of bins along rollouts, count transitions z_t → z_{t+1} and normalize them into a transition matrix P.
- This yields a data-driven, coarse-grained dynamics model in “hazard-lidar space”.
Label “hazard” abstract states: use the Stage 0 risk estimate per bin and mark bins whose estimated risk exceeds a threshold as “hazard bins” (optionally treat unvisited/unknown bins as hazardous).
Answer bounded-horizon reachability queries: compute quantities of the form P(F<=H hazard) for multiple horizons H (e.g., 5/10/20/50), which summarize: “starting from the rollout start distribution, what is the probability of hitting a hazard bin within H abstract steps?”
Export an inspectable graph view: export a node table (bin intervals, visit counts, start probability, risk, hazard label), an edge table (observed transitions + probabilities), and a JSON summary for downstream inspection and reuse.

Outputs (under outputs/safety_gym/)

A node table and edge table describing the learned abstraction graph.
A summary JSON reporting visited bins, number of hazard bins, and P(F<=H hazard) across horizons.

Why Stage 2 matters

It turns per-step rollouts into an interpretable, reusable model of “how the policy moves in hazard proximity space”.
It provides a bridge between empirical risk estimation and more formal reasoning: you get quantitative, bounded-horizon safety reachability statements that can guide where to focus later, more expensive checks.

Stage 3 — Local formal checks (SMT / certification views)

Stage 3 focuses on local, assumption-explicit formal checks: rather than trying to prove global safety, it zooms into regions suggested by data (high predicted risk) and asks a solver to either (a) produce a concrete counterexample inside the region, or (b) certify the region safe for a one-step model.

Key idea: local region selection is data-driven

A learned risk model (from Stage 0 artifacts) is used to score states seen in rollouts.
The top-risk states (or risk intervals) define local neighborhoods (L∞ boxes) where formal effort is concentrated.

(3A) Local SMT counterexample search (safety_gym/stage3_local_smt.py)

What is modeled: a local linear closed-loop approximation in a compact risk-feature space:
- pick a center x0 and box radius δ in the risk-feature space,
- fit x_next ≈ A x + c from rollout transitions that stay within the box.
What is “unsafe”: “unsafe within the next LOCAL_HORIZON steps”, approximated by the learned logistic risk model. The SMT query checks whether the predicted unsafe probability of the next state exceeds a threshold p0.
What the solver answers: a satisfiability query of the form:
∃ x in box(x0, δ) s.t. risk_model(x_next) >= p0 with x_next = A x + c.
- SAT: returns a witness x* (a concrete counterexample candidate inside the region).
- UNSAT: under the linear approximation + risk model, no such one-step “high-risk” successor exists in the queried box.
- UNKNOWN: inconclusive.
Why this is useful: it turns “high predicted risk” into a constructive, checkable counterexample candidate (or a local no-counterexample statement) under explicit modeling assumptions.

(3B) α,β-CROWN one-step local certification (safety_gym/stage3_ppo_crown_one_step.py)

What is modeled: a one-step closed-loop model combining:
- a locally-fit open-loop linear dynamics o_next ≈ A o + B a + c, and
- the PPO policy’s deterministic action a = μ(o) (the neural network). This yields a differentiable closed-loop map g(o) = A o + B μ(o) + c suitable for neural-network bound propagation.
What is “unsafe”: a conservative proxy predicate on the next observation, because the full learned risk model (with non-linear feature extraction like top-k means/max) is not encoded end-to-end. The proxy used here is:
- mean(top-k(hazards_lidar(o_next))) >= threshold.
What the certifier answers: using α,β-CROWN (via auto_LiRPA), compute sound bounds on g(o) for all o in the input box, then bound the proxy predicate:
- SAFE: the certified upper bound of mean(top-k(hazards_lidar(o_next))) is below threshold (region is one-step safe under the model).
- UNSAFE: the certified lower bound is above threshold (region is provably unsafe under the proxy).
- UNKNOWN: bounds straddle the threshold (inconclusive).
Why this is useful: it can give a true “for all states in this region” statement (within the one-step linear dynamics approximation and proxy predicate), complementing the SMT-style “there exists a counterexample” search.

Why Stage 3 matters

It distinguishes “counterexample found”, “locally certified safe”, and “inconclusive”, and makes the assumptions explicit (local region + linearized dynamics + risk/proxy predicate), which is crucial for debugging and for targeted safety improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

safe-rl: Formal Methods for Safe RL

Features

Repository Layout

CartPole Track (Gymnasium)

Setting up

Stage 0 — Training + data generation

Stage 1 — Runtime monitoring (spec satisfaction on traces)

LTL Monitor and Specification

What stage1 does

Stage 2 — Sampling-based approximate verification (state-space picture)

Stage 2 details: what is being approximated and why

Stage 3 — SMT-based local checks (proof/CE on simplified model)

Stage 3 details: what the solver proves (and what it does not)

Safety Gymnasium Track (SafetyPointGoal1-v0)

Setting up

Stage 0 — Train a baseline + generate downstream artifacts

Stage 1

Stage 2

Stage 3 — Local formal checks (SMT / certification views)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cartpole		cartpole
safety_gym		safety_gym
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

safe-rl: Formal Methods for Safe RL

Features

Repository Layout

CartPole Track (Gymnasium)

Setting up

Stage 0 — Training + data generation

Stage 1 — Runtime monitoring (spec satisfaction on traces)

LTL Monitor and Specification

What stage1 does

Stage 2 — Sampling-based approximate verification (state-space picture)

Stage 2 details: what is being approximated and why

Stage 3 — SMT-based local checks (proof/CE on simplified model)

Stage 3 details: what the solver proves (and what it does not)

Safety Gymnasium Track (SafetyPointGoal1-v0)

Setting up

Stage 0 — Train a baseline + generate downstream artifacts

Stage 1

Stage 2

Stage 3 — Local formal checks (SMT / certification views)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages