Skip to content

Ewendawi/safe-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

safe-rl: Formal Methods for Safe RL

Features

  • Two end-to-end tracks: a minimal CartPole-v1 track (DQN + LTL monitoring + SMT) and a Safety-Gymnasium track (SafetyPointGoal1-v0, PPO + cost-based safety + local verification/certification).
  • Runtime monitoring on finite traces: express safety/goal properties as small temporal-logic templates (e.g., G(·), F(·)) plus episode-level budgets, and report satisfaction rates and violation diagnostics.
  • Data-driven risk artifacts: estimate bounded-horizon “unsafe within H steps” risk from rollouts (1D hazard-lidar bin risk and a lightweight learned risk model) for downstream selection and comparison.
  • Counterexample-guided improvement loop: use high-risk regions as “counterexamples” to shape retraining via extra penalties and re-evaluate under the same monitoring protocol.
  • Local formal checks with explicit assumptions: (i) Z3-based local SMT search for counterexamples under a local linear approximation, and (ii) α,β-CROWN one-step local certification over a PPO policy with a proxy unsafe predicate.

Repository Layout

safe-rl/
├── README.md                      
├── requirements.txt               
├── cartpole/                      
│   ├── stage0_simulation.py              # train / rollout controller
│   ├── stage0_plot_training.py    
│   ├── stage0_plot_trajectories.py
│   ├── stage1_monitoring.py              # trace-based spec monitoring
│   ├── stage2_sampling_verify.py         # sampling-based safety map
│   ├── stage2_compare_robustness.py      # robustness comparisons
│   ├── stage3_smt_verify.py              # local SMT checks
│   ├── stage3_smt_ce_guided.py           # CE-guided SMT checks
├── safety_gym/                    
│   ├── stage0_train.py                   # train PPO baseline
│   ├── stage0_rollout_and_risk.py        # rollouts + risk artifacts
│   ├── stage0_plot_train.py              
│   ├── stage1_monitoring.py              # runtime monitoring over episode traces
│   ├── stage1_analyze.py                 
│   ├── stage1_ce_retrain.py              # counterexample-guided retraining
│   ├── stage2_abstraction_mc_hazard_lidar.py # 1D abstraction + reachability queries
│   ├── stage3_local_smt.py               # local SMT search (risk-feature space)
│   ├── stage3_ppo_crown_one_step.py      # α,β-CROWN one-step local certification
│   ├── plot_stage3.py             
└── outputs/                       # generated outputs (plots, logs, models)

CartPole Track (Gymnasium)

Setting up

Install Python dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

This track uses:

  • gymnasium for CartPole-v1
  • torch for a small DQN controller (ReLU MLP)
  • z3-solver for SMT-based Stage 3 experiments

All outputs for this track are written under outputs/cartpole_stage0/.

Stage 0 — Training + data generation

  • Goal: learn a working controller and produce representative rollouts for analysis.
  • What happens: a policy is trained in the environment; then trajectories are collected to serve as the “dataset” for later stages.
  • Why it matters: later stages treat the learned controller and its rollouts as the object of safety analysis.

Stage 1 — Runtime monitoring (spec satisfaction on traces)

  • Goal: evaluate a clear safety specification on recorded trajectories.
  • What happens: define a safety property (e.g., “always stay within position/angle bounds”) and compute how often it is satisfied, plus when/where violations occur.
  • Why it matters: turns “it seems safe” into measurable spec satisfaction metrics.

LTL Monitor and Specification

This track uses a deliberately small LTL fragment to express safety properties and monitor them on finite RL rollouts.

Specification (LTL fragment)

  • Formula shape: only top-level G ψ (“always”) or F ψ (“eventually”) is supported, where ψ contains only boolean structure over atomic propositions.
  • Atomic proposition: a Python predicate p(state) -> bool evaluated on each observed state.
  • In CartPole, the state is a 4D vector [x, x_dot, theta, theta_dot]. A standard safety atom is: |x| <= x_max ∧ |theta| <= theta_max.
  • Implementation references: cartpole/common/specs.py:1.

Monitor (finite-trace semantics)

  • Given a finite trajectory s_0, …, s_{T-1}:
    • G ψ is satisfied iff ψ(s_t) holds for every t.
    • F ψ is satisfied iff ψ(s_t) holds for at least one t.
  • Implementation idea (streaming / single pass):
    • Evaluate ψ(s_t) at each step (eval_state evaluates the non-temporal part).
    • Maintain a running boolean:
      • G ψ: start true; flip to false on the first violation.
      • F ψ: start false; flip to true on the first satisfaction.
  • Implementation references: cartpole/common/monitors.py:1.

What stage1 does

Motivation

  • Environment termination is not the same thing as a safety specification: it is a task-specific stopping rule and often too permissive (or simply different) from the safety envelope you care about.
  • Stage 1 makes the safety requirement explicit and measurable by checking a specification on trajectories produced by the learned controller.
  • It also lets you tighten/loosen the safety margin (e.g., choose stricter bounds than the environment’s default thresholds) and observe how the satisfaction rate changes.

What it computes

  • Inputs: trajectories collected in Stage 0 (a list of rollouts; each rollout stores a time series of states plus actions/rewards/dones).
  • Specification: build the safety predicate with chosen (x_max, theta_max) and form φ_safe = G(p_safe).
  • For each trajectory:
    • Run the LTL monitor over the full state sequence to get a final verdict (satisfied/violated).
    • Record the first time step that violates the safety predicate (a simple “time-to-violation” statistic).
  • Aggregate over trajectories:
    • safe ratio (#satisfying / #total), number of unsafe episodes, and simple stats of the first-violation step.

Extra sanity-check view

  • Stage 1 also reports a “no-monitor sweep” under the same thresholds: episode-level and step-level violation rates. This gives a second perspective on how strict the chosen safety bounds are (and helps catch edge cases where the spec is too tight/too loose).

Stage 2 — Sampling-based approximate verification (state-space picture)

  • Goal: map out which initial conditions tend to violate the safety spec.
  • What happens: systematically sample a grid of initial states, simulate the closed loop for a fixed horizon, and label each initial condition as safe/unsafe under the chosen spec.
  • Why it matters: produces an interpretable “unsafe region” approximation and a set of representative counterexamples.

Stage 2 details: what is being approximated and why

Motivation

  • Stage 1 tells you “how often the spec is violated on the trajectories you happened to collect”.
  • Stage 2 tries to answer a different question: “from which states does the closed loop tend to become unsafe?”, i.e., give a geometric picture in state space.
  • This is useful for:
    • identifying vulnerable regions (where safety is brittle),
    • extracting counterexamples to guide later formal checks (Stage 3),
    • comparing controllers beyond reward (e.g., NN vs a hand-designed baseline).

Problem being approximated

  • Ideal (hard) goal: characterize the set of initial states from which the closed-loop system will ever violate the safety spec.
  • Stage 2 uses a pragmatic approximation:
    • restrict to a low-dimensional slice of the state space (e.g., vary (x, theta) and fix velocities),
    • check violations over a finite horizon (bounded-time rather than “forever”),
    • label states by simulation outcomes (sampling-based rather than proof-based).

What gets labeled as “unsafe”

  • Each sampled initial condition is rolled out under the learned controller for a fixed number of steps (or until termination).
  • If the safety predicate is violated at any time during this rollout window, the initial condition is labeled unsafe; otherwise safe (for the chosen horizon and thresholds).
  • This creates an “unsafe mask” over the sampled grid, which can be visualized as a heatmap/contour.

Counterexamples (why we store them)

  • A counterexample here is simply a sampled initial condition that triggers a violation (plus basic context such as the first violation time).
  • These counterexamples act as “seeds” for Stage 3: rather than guessing where to formally verify, you verify local regions around empirically bad states.

Robustness / controller comparison (optional extension)

  • Beyond a binary safe/unsafe map, Stage 2 can summarize how close states are to the safety boundary (a normalized margin) and plot “margin vs violation probability” curves.
  • The same procedure can be run for multiple controllers (e.g., the learned NN controller vs a heuristic baseline) to compare safety robustness across the state space, not just on a single set of rollouts.

Stage 3 — SMT-based local checks (proof/CE on simplified model)

  • Goal: on a local region of the state space, decide whether an unsafe next step is possible under a formal encoding.
  • What happens: encode the controller and a simplified one-step closed-loop model into a constraint solver; either extract a concrete counterexample or conclude no counterexample exists within the queried region (under the modeling assumptions).
  • Why it matters: complements Stage 2’s empirical evidence with solver-backed local guarantees (or solver-found counterexamples).

Stage 3 details: what the solver proves (and what it does not)

Motivation

  • Stage 2 is evidence-based: it can miss rare failures (sampling gaps) and it cannot distinguish “no counterexample found” from “provably none exist”.
  • Stage 3 introduces a proof engine (SMT) to make existence statements inside a chosen region: either produce a counterexample (SAT) or rule out counterexamples (UNSAT) under an explicit model.

What is being verified

  • Query type: local, bounded safety checks over a region in state space.
  • Region: a “box” around a chosen center state (e.g., around states suggested by Stage 2 counterexamples).
  • Property: existence of an initial state in the region such that the next step violates the safety bounds (a one-step safety violation query).

Why “one-step” and why “simplified model”

  • Directly verifying the full CartPole dynamics and multi-step behavior end-to-end is much harder.
  • Stage 3 uses a deliberately simplified one-step closed-loop model to:
    • keep the solver problem tractable,
    • demonstrate how to encode a neural controller + dynamics + safety predicate as constraints,
    • provide quick feedback on whether a region is obviously unsafe under conservative assumptions.
  • Interpretation: results are only as sound as the modeling assumptions (dynamics approximation, one-step horizon, region definition).

How to read solver outcomes

  • SAT (unsafe exists): the solver returns a concrete state in the region (and an action choice consistent with the controller) that violates the one-step safety bounds under the model.
    • This is a constructive counterexample for the modeled system.
  • UNSAT (no unsafe exists in region): under the model and within the queried region, no one-step unsafe transition exists.
    • This is a local guarantee, not a global or multi-step guarantee.
  • UNKNOWN / timeout: the solver could not decide in time; treat as “inconclusive” rather than safe.

CE-guided variant (connecting Stage 2 → Stage 3)

  • Instead of manually picking regions, Stage 3 can pick region centers from Stage 2’s sampled counterexamples.
  • This makes the formal step targeted: you spend solver effort where the empirical method already suggests risk.
  • The resulting SMT counterexamples can then be compared back to sampling results for consistency checks and debugging of modeling assumptions.

Safety Gymnasium Track (SafetyPointGoal1-v0)

Setting up

This track uses the Safety-Gymnasium environment SafetyPointGoal1-v0, trains a PPO controller, and treats the environment-provided cost (safety cost) as the safety signal throughout the pipeline.

Suggested extra dependencies (in addition to requirements.txt):

  • safety-gymnasium
  • stable-baselines3
  • scikit-learn (local linear approximation, risk model fitting)
  • (optional) alpha-beta-CROWN (Stage 3 NN bounding/certification experiments)

All artifacts for this track are written under outputs/safety_gym/.

Stage 0 — Train a baseline + generate downstream artifacts

  • Goal: obtain a baseline policy and generate reusable rollouts and risk artifacts for later stages.
  • What it does :
    • Training: train PPO on SafetyPointGoal1-v0 so the policy learns to reach the goal / achieve reward.
    • Evaluation: on fresh episodes, log basic “task performance vs safety cost” metrics (return, total cost, episode length, collision/trigger counts, etc.).
    • Data generation: roll out the fixed baseline to produce trajectories (time series of observations/actions/reward/cost), serving as the dataset for Stage 1–3.
    • Risk artifacts: define local risk as “whether a cost event will occur within the next H steps” (bounded-horizon unsafe probability), and estimate it from rollouts:
      • An interpretable 1D risk map: extract a hazard-proximity feature from hazards_lidar, bin it, and estimate risk per bin.
      • A more flexible risk model: fit a model on compact features to predict unsafe probability, used later to select high-risk local regions.
  • Why it matters: it solidifies “empirical evidence (rollouts) → reusable risk description”, so Stage 1/2/3 can compare, improve, and verify against the same evidence.

Stage 1

The core idea of this stage is to turn “safety” (as expressed by the environment’s cost signal and sensor readings) into a small set of clear, measurable specifications (specs), then run runtime monitoring over full episode traces (finite-trace semantics) to produce comparable safety metrics and diagnostic visualizations.

Inputs

  • Policy: the Stage 0 baseline (and, optionally, improved policies from counterexample-guided retraining, evaluated under the same monitoring protocol).
  • Data: evaluation episodes in SafetyPointGoal1-v0; each step provides reward, cost, and observations/sensors (e.g., goal/hazard/obstacle lidar, velocity, etc.).

What Stage 1 does (process/outcomes, not implementation details)

  1. Define specifications (specs): use a small temporal-logic vocabulary plus episode-level budgets to express “achieve the task while respecting safety/resource limits”.
    • Typical temporal templates:
      • G(·): must hold at every step (always).
      • F(·): must hold at least once (eventually).
    • Two main families:
      • Task-related: e.g., “eventually reach the goal” (F(goal_reached)).
      • Safety/resource-related: e.g., “never incur safety cost” (G(no_step_cost)), “episode total cost stays within a budget” (total_cost <= C_max), “energy/velocity budgets stay within bounds” (e.g., total_energy <= E_max, G(speed_limit_ok)), and finer-grained envelopes like “never enter hazards / never collide with obstacles”.
  2. Monitor each episode trace: treat each episode as a finite trace and check each spec along the trajectory, producing a satisfied/violated verdict per spec per episode.
  3. Aggregate into comparable metrics:
    • Per episode: log task return, total cost, episode length, and spec satisfaction outcomes.
    • Across episodes: compute satisfaction rates per spec to answer “how often does this policy meet the safety definition?”
  4. Produce diagnostic visualizations: put “performance vs safety” into the same view to understand trade-offs, e.g.:
    • Bar charts of spec satisfaction rates (which constraints are hardest to satisfy).
    • Return–cost scatter plots (colored by safe/unsafe or by violation type) to identify failure modes.
  5. Enable policy comparison (baseline vs improved policies): when using counterexample-guided retraining or other refinements, Stage 1 provides a consistent evaluation protocol (same specs + same plots) to compare safety improvements and any performance trade-offs.

Why Stage 1 matters

  • It elevates “cost observed during training” into an explicit safety definition with reproducible evaluation metrics, instead of relying on vague averages of cost/return.
  • It provides the baseline and debugging entry point for Stages 2/3: knowing which specs fail and in what kinds of episodes makes targeted improvement and local formal checks much more effective.

Stage 2

This stage builds a lightweight abstraction of the closed-loop system from data, so we can ask simple “model-checking-style” questions about how likely it is to reach hazardous situations within a bounded horizon—without relying on absolute (x, y) positions (which Safety-Gymnasium does not reliably expose in observations).

Inputs

  • Raw rollouts from Stage 0 (full episode traces).
  • A 1D, binned “local risk” estimate derived from the same rollouts, using a hazard-proximity feature from hazards_lidar (bins over [0, 1]).
  • A risk threshold that defines which abstract bins are considered “hazardous”.

What Stage 2 does (process/outcomes, not implementation details)

  1. Choose a 1D abstraction feature: map each observation to a scalar “hazard proximity” score extracted from hazards_lidar (high means close to hazards).
  2. Discretize into bins: quantize the scalar feature into B bins, yielding an abstract state z ∈ {0, …, B−1} for every time step of every rollout.
  3. Learn an abstract Markov chain (MC): from the sequence of bins along rollouts, count transitions z_t → z_{t+1} and normalize them into a transition matrix P.
    • This yields a data-driven, coarse-grained dynamics model in “hazard-lidar space”.
  4. Label “hazard” abstract states: use the Stage 0 risk estimate per bin and mark bins whose estimated risk exceeds a threshold as “hazard bins” (optionally treat unvisited/unknown bins as hazardous).
  5. Answer bounded-horizon reachability queries: compute quantities of the form P(F<=H hazard) for multiple horizons H (e.g., 5/10/20/50), which summarize: “starting from the rollout start distribution, what is the probability of hitting a hazard bin within H abstract steps?”
  6. Export an inspectable graph view: export a node table (bin intervals, visit counts, start probability, risk, hazard label), an edge table (observed transitions + probabilities), and a JSON summary for downstream inspection and reuse.

Outputs (under outputs/safety_gym/)

  • A node table and edge table describing the learned abstraction graph.
  • A summary JSON reporting visited bins, number of hazard bins, and P(F<=H hazard) across horizons.

Why Stage 2 matters

  • It turns per-step rollouts into an interpretable, reusable model of “how the policy moves in hazard proximity space”.
  • It provides a bridge between empirical risk estimation and more formal reasoning: you get quantitative, bounded-horizon safety reachability statements that can guide where to focus later, more expensive checks.

Stage 3 — Local formal checks (SMT / certification views)

Stage 3 focuses on local, assumption-explicit formal checks: rather than trying to prove global safety, it zooms into regions suggested by data (high predicted risk) and asks a solver to either (a) produce a concrete counterexample inside the region, or (b) certify the region safe for a one-step model.

Key idea: local region selection is data-driven

  • A learned risk model (from Stage 0 artifacts) is used to score states seen in rollouts.
  • The top-risk states (or risk intervals) define local neighborhoods (L∞ boxes) where formal effort is concentrated.

(3A) Local SMT counterexample search (safety_gym/stage3_local_smt.py)

  • What is modeled: a local linear closed-loop approximation in a compact risk-feature space:
    • pick a center x0 and box radius δ in the risk-feature space,
    • fit x_next ≈ A x + c from rollout transitions that stay within the box.
  • What is “unsafe”: “unsafe within the next LOCAL_HORIZON steps”, approximated by the learned logistic risk model. The SMT query checks whether the predicted unsafe probability of the next state exceeds a threshold p0.
  • What the solver answers: a satisfiability query of the form:
    ∃ x in box(x0, δ) s.t. risk_model(x_next) >= p0 with x_next = A x + c.
    • SAT: returns a witness x* (a concrete counterexample candidate inside the region).
    • UNSAT: under the linear approximation + risk model, no such one-step “high-risk” successor exists in the queried box.
    • UNKNOWN: inconclusive.
  • Why this is useful: it turns “high predicted risk” into a constructive, checkable counterexample candidate (or a local no-counterexample statement) under explicit modeling assumptions.

(3B) α,β-CROWN one-step local certification (safety_gym/stage3_ppo_crown_one_step.py)

  • What is modeled: a one-step closed-loop model combining:
    • a locally-fit open-loop linear dynamics o_next ≈ A o + B a + c, and
    • the PPO policy’s deterministic action a = μ(o) (the neural network). This yields a differentiable closed-loop map g(o) = A o + B μ(o) + c suitable for neural-network bound propagation.
  • What is “unsafe”: a conservative proxy predicate on the next observation, because the full learned risk model (with non-linear feature extraction like top-k means/max) is not encoded end-to-end. The proxy used here is:
    • mean(top-k(hazards_lidar(o_next))) >= threshold.
  • What the certifier answers: using α,β-CROWN (via auto_LiRPA), compute sound bounds on g(o) for all o in the input box, then bound the proxy predicate:
    • SAFE: the certified upper bound of mean(top-k(hazards_lidar(o_next))) is below threshold (region is one-step safe under the model).
    • UNSAFE: the certified lower bound is above threshold (region is provably unsafe under the proxy).
    • UNKNOWN: bounds straddle the threshold (inconclusive).
  • Why this is useful: it can give a true “for all states in this region” statement (within the one-step linear dynamics approximation and proxy predicate), complementing the SMT-style “there exists a counterexample” search.

Why Stage 3 matters

  • It distinguishes “counterexample found”, “locally certified safe”, and “inconclusive”, and makes the assumptions explicit (local region + linearized dynamics + risk/proxy predicate), which is crucial for debugging and for targeted safety improvements.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages