Skip to content

MHS-20/ScalarPath

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

469 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build

ScalarPath

ScalarPath is an interactive pathfinding simulator that brings together deterministic planning, stochastic simulation, multi-agent coordination, and reinforcement learning — all visualised on a live tile grid.

Logo
---

Table of Contents

  1. Project Overview
  2. Terrain and Tile Types
  3. Single Agent Mode — Deterministic Planning
  4. Stochastic Pathfinding — SPN and CTMC
  5. Multi-Agent Mode — DAP
  6. Reinforcement Learning — Q-Learning
  7. Architecture Overview

1. Project Overview

ScalarPath is a desktop Swing application (Scala 3) that lets you place an agent on a procedurally generated tile grid and observe how different planning and learning strategies navigate it. The grid can be as simple as an open floor or as complex as a terrain map with obstacles, water, traps, and teleport tiles.

Four fundamentally different execution modes are available, selectable at runtime from the left sidebar:

Mode What runs Randomness
Single Deterministic A* or Prolog/BFS planner None
Swarm DAP N agents coordinated via a stochastic Petri net CTMC sampling
RL Agent Q-learning on a deterministic grid ε-greedy exploration
RL Stochastic Q-learning on a CTMC-backed stochastic grid CTMC + ε-greedy

It is also available an heatmap overlay showing the tiles-weighted distance to the goal. The heatmap is precomputed in deterministic mode, instead is updated in real-time during the stochastic or RL mode simulations.

Screen1
Screen2

2. Terrain and Tile Types

Every tile on the grid is one of the following:

Type Colour CTMC departure rate Dijkstra entry cost
Grass Green 10.0 0.10
Floor Light grey 5.0 0.20
Water Blue 1.0 1.00
Trap Orange 0.5 2.00
Obstacle Dark — (impassable)
Special Varied 5.0 0.20

The departure rate governs how quickly an agent leaves a tile during stochastic simulation. Fast terrain (Grass, high rate) is easy to traverse; slow terrain (Trap, low rate) holds the agent longer. The entry cost is the reciprocal of the rate and is used by Dijkstra when computing distances for the softmax bias.

Scenario Generation:

  • Terrain maps are generated with Perlin noise (Ken Perlin's smooth gradient noise, with fade function f(t) = 6t⁵ − 15t⁴ + 10t³), producing natural-looking landscapes with gradual transitions between tile types.
  • The Maze scenario uses branching recursive division to create a perfect maze (one unique path between any two points).
  • the Specials scenario adds programmable tiles (teleports, jumps) whose landing positions are defined by arbitrary Position => Position functions registered at construction time.

3. Single Agent Mode — Deterministic Planning

A single agent is placed at a start position and must reach the goal. The planner computes a complete path before execution; the agent then steps along it one direction at a time.

Three backend planners are available from the algorithm dropdown:

A* (AStarAlgorithm) — the canonical best-first search. The heuristic is squared Euclidean distance:

h(n) = (n.x − goal.x)² + (n.y − goal.y)²

The priority queue is ordered by f(n) = g(n) + h(n) where g(n) is the true cost from start. Special tiles (teleports) are handled by detecting large position deltas during path reconstruction and mapping them back to their source tile.

BFS / DFS (Prolog) — the Prolog-backed planners encode the grid as a set of move/2 facts and run breadth-first or depth-first search via tuProlog. The Scala side sends the serialised tile grid, directions, start and goal to the Prolog engine and deserialises the resulting plan.

Stochastic planner — wraps the same Prolog engine but uses the stochastic distance oracle for the heuristic, so the plan accounts for terrain cost rather than pure hop count.


4. Stochastic Pathfinding — SPN and CTMC

This is the formal core shared by both Swarm DAP mode and RL Stochastic mode.

4.1 Continuous-Time Markov Chain (CTMC)

A CTMC is a stochastic process that jumps between discrete states, spending a random amount of time in each one.

Formally, a CTMC over state space S is defined by a rate matrix Q where q(s, s') is the rate at which the process transitions from state s to state s'. The holding time in state s is exponentially distributed:

T_s ~ Exp(Λ_s)   where Λ_s = Σ_{s' ≠ s} q(s, s')

The probability of jumping to a particular next state is proportional to the rate of that transition:

P(next = s' | current = s) = q(s, s') / Λ_s

Gillespie algorithm

The simulator uses the Gillespie direct method (also called kinetic Monte Carlo) to generate exact sample paths:

  1. From state s, compute all outgoing transitions {(λ₁, s₁), …, (λₙ, sₙ)}.
  2. Sum the total rate: Λ = Σ λᵢ.
  3. Sample holding time: Δt = −ln(U) / Λ where U ~ Uniform(0,1).
  4. Choose next state sᵢ with probability λᵢ / Λ (weighted draw).
  5. Advance time by Δt, move to sᵢ, repeat.

Scala implementation

// CTMC.scala
trait CTMC[S]:
  def transitions(s: S): Set[Action[S]]   // Action = (rate, nextState)

// CTMCSimulation.scala — Gillespie step
LazyList.iterate(Event(0.0, s0)):
  case Event(t, s) =>
    val choices  = transitions(s).map(a => (a.rate, a.state)).toList
    val cumul    = Stochastics.cumulative(choices)   // prefix-sum of rates
    val totalR   = cumul.last._1
    val next     = Stochastics.draw(cumul)           // weighted random draw
    val dt       = -Math.log(rnd.nextDouble()) / totalR
    Event(t + dt, next)

The result is a LazyList[Event[S]] — an infinite, lazily evaluated stream of (time, state) pairs. Callers stop it with .takeWhile, .find, or a time bound.

4.2 Stochastic Petri Net (SPN)

A Stochastic Petri Net is a higher-level modelling language that compiles down to a CTMC. It consists of:

  • Places — abstract locations (here: grid positions)
  • Tokens — discrete units occupying places (here: the agent)
  • Transitions — firing rules that consume tokens from pre-condition places and produce tokens in effect places

Each transition has:

  • cond: MSet[P] — tokens consumed (pre-condition)
  • rate: MSet[P] => Double — firing rate as a function of the current marking
  • eff: MSet[P] — tokens produced (post-condition)
  • inh: MSet[P]inhibitor arc: the transition is disabled if any of these tokens are present

The marking is a multiset MSet[P] (a bag of place tokens). The SPN is lifted to a CTMC over markings by:

toCTMC(spn)(marking) =
  { Action(rate(marking), marking − cond ∪ eff)
    | Trn(cond, rate, eff, inh) ∈ spn,
      cond ⊆ marking,
      inh ∩ marking = ∅,
      rate(marking) > 0 }

Grid mapping

Tile grid               →  Petri net places
Agent at position p     →  one token at place p
Move p → q              →  Trn(cond={p}, rate=f, eff={q}, inh={})
Impassable tile         →  no transitions lead there

MSet in Scala

MSet[A] is an immutable multiset backed by Map[A, Int]. Key operations used by SPN firing:

  • extract(m) — removes m from this if possible, returning Some(remainder)
  • disjoined(m) — inhibitor check: true iff this ∩ m = ∅
  • union(m) — multiset union (additive)
// SPN.toCTMC firing rule
for
  Trn(cond, rate, eff, inh) <- spn
  if m.disjoined(inh)          // inhibitor check
  r = rate(m)
  if r > 0.0
  out <- m.extract(cond)       // pre-condition: consume tokens
yield Action(r, out.union(eff)) // produce new marking

4.3 Goal-oriented rate biasing (softmax)

A purely random walk would eventually reach the goal but inefficiently. To bias the agent toward the goal without removing the stochasticity, each transition rate is multiplied by a softmax improvement factor:

rate(from → to) = terrainRate(from) × exp(β × (dist(from, goal) − dist(to, goal)))

where dist(p, goal) is the terrain-weighted shortest distance (Dijkstra on the reversed graph), and β is the temperature parameter:

  • β = 0 → purely random walk, rates equal terrain base rates
  • β = 2 → recommended default, moderate goal bias
  • β = 5+ → near-deterministic greedy behaviour

Moves that decrease distance to the goal have a positive exponent → rate boosted. Moves that increase distance have a negative exponent → rate suppressed. The CTMC still fires randomly; β only shifts the probability mass.

Distance oracle

Distances are pre-computed once by running Dijkstra backwards from the goal on the reversed adjacency graph. This fills the entire Map[Position, Double] in one pass — O(|V| log |V|) — and is shared across all agents with the same goal.

Three oracle implementations: BFS (unweighted hop count), TerrainBFS (Dijkstra with terrain entry costs, the default), and AStar (also exposes a single-source query for targeted lookups).


5. Multi-Agent Mode — DAP

Theory

The Distributed Asynchronous stochastic Petri net (DAP) extends SPN to a network of nodes, each with its own local marking, communicating via broadcast messages.

A DAP rule reads:

pre --rateExp(localMarking)--> eff  |  ^msg

The global state is:

  • tokens: MSet[Token[ID, P]] — all tokens in the network, tagged by node identity
  • messages: MSet[Token[ID, P]] — pending broadcast messages
  • neighbours: Map[ID, Set[ID]] — network topology

Two types of CTMC actions emerge:

Rule firing at a node (stochastic): a node's local marking matches pre; it fires at rate rateExp(localMarking), replacing pre with eff and broadcasting msg to all neighbours.

Message delivery (instantaneous, rate = ∞): a pending message token is delivered to all neighbours of the sender and removed from the queue.

Multi-agent pathfinding via DAP

Each agent is a DAP node. The "network topology" is all-to-all (every agent is neighbour to every other agent, though the current movement rules use no gossip — msg = ∅). Each directed edge from → to in the grid becomes a rule for every agent, with the goal-oriented rate:

rate(id, from → to) = terrainRate(from) × exp(β × (dist_id(from) − dist_id(to)))

where dist_id is the distance map pre-computed for agent id's individual goal.

Collision avoidance is enforced by filtering out any CTMC transition that would result in two agents occupying the same tile:

val ctmc = s => raw.transitions(s).filterNot(a => hasCollision(a.state))

This is a hard constraint — colliding transitions are simply removed from the transition set, so the CTMC naturally routes around them.

Scala implementation

// DAP rule firing (one node)
for
  Rule(pre, rateExp, eff, msg) <- spn
  nodeId                       <- neighbours.keySet
  out                          <- tokens.extract(pre.map(Token(nodeId, _)))
  newTokens                     = out.union(eff.map(Token(nodeId, _)))
  rate                          = rateExp(localTokens(tokens, nodeId))
  if rate > 0.0
yield Action(rate, State(newTokens, newMessages, neighbours))

The controller (MultiAgentController) runs the CTMC trace in a background thread and pushes position/trail updates to the view via a step callback, which the Swing EDT uses to repaint the grid.


6. Reinforcement Learning — Q-Learning

6.1 Theory

Q-learning is a model-free, off-policy temporal difference (TD) algorithm. The agent learns a state-action value function Q(s, a) — the expected cumulative discounted reward for taking action a in state s and following the optimal policy thereafter.

Q-learning update rule (Bellman equation)

At each step the agent is in state s, takes action a, receives reward r, and lands in state s':

Q(s, a) ← (1 − α) · Q(s, a) + α · (r + γ · max_{a'} Q(s', a'))
  • α ∈ (0, 1]learning rate: how strongly new information overwrites old
  • γ ∈ [0, 1]discount factor: how much future rewards are worth relative to immediate ones
  • r + γ · max Q(s', a')Bellman target: immediate reward plus discounted value of the best next action

After sufficient training, Q(s, a) converges to Q(s, a)*, the optimal action-value function, and the greedy policy π*(s) = argmax_a Q*(s, a) is optimal.

Exploration vs. exploitation — ε-greedy

During training, a pure greedy policy would fail to explore states it hasn't visited. The ε-greedy policy balances this:

π_ε(s) = { random action    with probability ε
          { argmax_a Q(s,a)  with probability 1 − ε

In this project ε = 0.2 by default. After training, playback uses the pure greedy policy (ε = 0).

Reward shaping

Outcome Reward
Reached the goal +100.0
Moved to a new passable tile −terrainCost(from)
Stayed in place (wall hit) −1.0 (stay penalty)

Terrain costs are the inverses of CTMC departure rates: Grass=0.1, Floor=0.2, Water=1.0, Trap=2.0. This means the agent learns not just to reach the goal but to prefer fast terrain.

6.2 RL Agent mode (deterministic)

The environment is a GridEnvironment.Deterministic: actions always succeed unless they hit an impassable tile, in which case the agent stays in place and pays the stay penalty.

override def apply(s: Position, a: Direction): (Reward, Position) =
  val candidate = s + a.vector
  val next      = if isPassable(candidate) then candidate else s
  (reward(s, next), next)

This is the standard grid-world RL setup. The learned policy can then be compared directly against A*.

6.3 RL Stochastic mode

The environment is a GridEnvironment.Stochastic. Each action does not deterministically move the agent; instead it instantiates a one-step CTMC run biased toward the chosen direction:

val spf = StochasticPathfinding(
  tiles, directions, s, goal,
  oracle = DistanceOracle.TerrainBFS,
  beta   = beta + actionBias,   // chosen direction extra-boosted
)
val trace  = spf.simulationTrace(rnd)
val nextPos = trace.drop(1).headOption
               .map(e => spf.positionOf(e.state))
               .getOrElse(s)

The actionBias raises the softmax temperature in the chosen direction, making it more likely the agent moves that way — but not certain. Other directions can still fire. The agent therefore trains under genuine stochastic dynamics, which is more realistic for physical or noisy environments.

The Q-table that results captures expected cumulative reward under uncertainty, not just optimal paths in a clean world.

6.4 Policy extraction and playback

After training, extractPath follows the greedy policy from start to goal:

def bestAction(pos: Position, forbidden: Set[Position]): Option[Direction] =
  val candidates = directions
    .map(d => d -> (pos + d.vector))
    .filter((_, next) => passable.contains(next))   // only passable moves
    .sortBy((d, next) => -q(pos, d))                // descending Q-value
  candidates.find((_, next) => !forbidden.contains(next))
    .orElse(candidates.headOption)
    .map(_._1)

Importantly, the forbidden set of already-visited positions is passed at each step so that the path-extractor can break cycles by trying the second-best passable action rather than looping indefinitely. This is essential in RL Stochastic mode where the Q-table may have convergence imperfections.

6.5 Scala implementation — QRL abstraction

The Q-learning engine is fully generic, parameterised on State and Action types:

trait QRL:
  type State; type Action; type Reward = Double

  trait Q extends ((State, Action) => Reward):
    def bestPolicy: Policy = s => actions.maxBy(this(s, _))
    def update(s: State, a: Action, v: Reward): Q
    def vFunction: State => Reward = s => actions.map(this(s, _)).max

The concrete QFunction is a mutable Map[(State, Action), Reward] with a default value v0 for unseen pairs. The QLearning case class holds the hyperparameters and implements the Bellman update:

def updateQ(s: State, qf: Q): (State, Q) =
  val a        = qf.epsPolicy(epsilon)(s)
  val (r, s2)  = system.environment(s, a)
  val bellman  = (1 - alpha) * qf(s, a) + alpha * (r + gamma * qf.vFunction(s2))
  (s2, qf.update(s, a, bellman))

Training runs in a background daemon thread (RLController), firing an EDT callback after each episode batch to update the progress bar without blocking the UI.


State machine

Simulation is a state machine expressed as StateT[IO, State, *] (Cats Effect). The main loop polls the current State every tick and dispatches on (previousState, currentState) transitions. The transition objects (Resume, Pause, Reset, ConfigurationChanged, …) are Scala 3 unapply extractors, keeping the match expression in handleTransition readable:

case Resume() =>
  if isRLMode then
    _rlController match
      case Some(rc) if rc.trainingComplete => rc.startPlayback()
      case _                               => startRL()
  else if _currentMode == Swarm then startSwarm()
  else View.resume()

Thread model

  • Main loop — a dedicated Thread polling Simulation.current at ~60 Hz, sleeping via SpeedManager.delay().
  • RL trainingRL-Training-Thread (daemon); fires EDT callbacks via Swing.onEDT after each batch.
  • RL playbackRL-Playback-Thread (daemon); advances _playbackIndex and fires view callbacks at playbackDelay ms per step.
  • Swarm simulationMultiAgentController runs the CTMC LazyList on a background thread and batches EDT updates.
  • Swing EDT — all view mutations go through Swing.onEDT; the grid's paintComponent reads only @volatile fields.

About

Stochastic, Swarm and RL Pathfinding Simulation

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors