ScalarPath is an interactive pathfinding simulator that brings together deterministic planning, stochastic simulation, multi-agent coordination, and reinforcement learning — all visualised on a live tile grid.
---- Project Overview
- Terrain and Tile Types
- Single Agent Mode — Deterministic Planning
- Stochastic Pathfinding — SPN and CTMC
- Multi-Agent Mode — DAP
- Reinforcement Learning — Q-Learning
- Architecture Overview
ScalarPath is a desktop Swing application (Scala 3) that lets you place an agent on a procedurally generated tile grid and observe how different planning and learning strategies navigate it. The grid can be as simple as an open floor or as complex as a terrain map with obstacles, water, traps, and teleport tiles.
Four fundamentally different execution modes are available, selectable at runtime from the left sidebar:
| Mode | What runs | Randomness |
|---|---|---|
| Single | Deterministic A* or Prolog/BFS planner | None |
| Swarm DAP | N agents coordinated via a stochastic Petri net | CTMC sampling |
| RL Agent | Q-learning on a deterministic grid | ε-greedy exploration |
| RL Stochastic | Q-learning on a CTMC-backed stochastic grid | CTMC + ε-greedy |
It is also available an heatmap overlay showing the tiles-weighted distance to the goal. The heatmap is precomputed in deterministic mode, instead is updated in real-time during the stochastic or RL mode simulations.
Every tile on the grid is one of the following:
| Type | Colour | CTMC departure rate | Dijkstra entry cost |
|---|---|---|---|
| Grass | Green | 10.0 | 0.10 |
| Floor | Light grey | 5.0 | 0.20 |
| Water | Blue | 1.0 | 1.00 |
| Trap | Orange | 0.5 | 2.00 |
| Obstacle | Dark | — (impassable) | ∞ |
| Special | Varied | 5.0 | 0.20 |
The departure rate governs how quickly an agent leaves a tile during stochastic simulation. Fast terrain (Grass, high rate) is easy to traverse; slow terrain (Trap, low rate) holds the agent longer. The entry cost is the reciprocal of the rate and is used by Dijkstra when computing distances for the softmax bias.
Scenario Generation:
- Terrain maps are generated with Perlin noise (Ken Perlin's smooth gradient noise, with fade function
f(t) = 6t⁵ − 15t⁴ + 10t³), producing natural-looking landscapes with gradual transitions between tile types. - The Maze scenario uses branching recursive division to create a perfect maze (one unique path between any two points).
- the Specials scenario adds programmable tiles (teleports, jumps) whose landing positions are defined by arbitrary
Position => Positionfunctions registered at construction time.
A single agent is placed at a start position and must reach the goal. The planner computes a complete path before execution; the agent then steps along it one direction at a time.
Three backend planners are available from the algorithm dropdown:
A* (AStarAlgorithm) — the canonical best-first search. The heuristic is squared Euclidean distance:
h(n) = (n.x − goal.x)² + (n.y − goal.y)²
The priority queue is ordered by f(n) = g(n) + h(n) where g(n) is the true cost from start.
Special tiles (teleports) are handled by detecting large position deltas during path reconstruction and mapping them back to their source tile.
BFS / DFS (Prolog) — the Prolog-backed planners encode the grid as a set of move/2 facts and run breadth-first or depth-first search via tuProlog.
The Scala side sends the serialised tile grid, directions, start and goal to the Prolog engine and deserialises the resulting plan.
Stochastic planner — wraps the same Prolog engine but uses the stochastic distance oracle for the heuristic, so the plan accounts for terrain cost rather than pure hop count.
This is the formal core shared by both Swarm DAP mode and RL Stochastic mode.
A CTMC is a stochastic process that jumps between discrete states, spending a random amount of time in each one.
Formally, a CTMC over state space S is defined by a rate matrix Q where q(s, s') is the rate at which the process transitions from state s to state s'. The holding time in state s is exponentially distributed:
T_s ~ Exp(Λ_s) where Λ_s = Σ_{s' ≠ s} q(s, s')
The probability of jumping to a particular next state is proportional to the rate of that transition:
P(next = s' | current = s) = q(s, s') / Λ_s
The simulator uses the Gillespie direct method (also called kinetic Monte Carlo) to generate exact sample paths:
- From state s, compute all outgoing transitions
{(λ₁, s₁), …, (λₙ, sₙ)}. - Sum the total rate:
Λ = Σ λᵢ. - Sample holding time:
Δt = −ln(U) / ΛwhereU ~ Uniform(0,1). - Choose next state sᵢ with probability
λᵢ / Λ(weighted draw). - Advance time by
Δt, move tosᵢ, repeat.
// CTMC.scala
trait CTMC[S]:
def transitions(s: S): Set[Action[S]] // Action = (rate, nextState)
// CTMCSimulation.scala — Gillespie step
LazyList.iterate(Event(0.0, s0)):
case Event(t, s) =>
val choices = transitions(s).map(a => (a.rate, a.state)).toList
val cumul = Stochastics.cumulative(choices) // prefix-sum of rates
val totalR = cumul.last._1
val next = Stochastics.draw(cumul) // weighted random draw
val dt = -Math.log(rnd.nextDouble()) / totalR
Event(t + dt, next)The result is a LazyList[Event[S]] — an infinite, lazily evaluated stream of (time, state) pairs.
Callers stop it with .takeWhile, .find, or a time bound.
A Stochastic Petri Net is a higher-level modelling language that compiles down to a CTMC. It consists of:
- Places — abstract locations (here: grid positions)
- Tokens — discrete units occupying places (here: the agent)
- Transitions — firing rules that consume tokens from pre-condition places and produce tokens in effect places
Each transition has:
cond: MSet[P]— tokens consumed (pre-condition)rate: MSet[P] => Double— firing rate as a function of the current markingeff: MSet[P]— tokens produced (post-condition)inh: MSet[P]— inhibitor arc: the transition is disabled if any of these tokens are present
The marking is a multiset MSet[P] (a bag of place tokens). The SPN is lifted to a CTMC over markings by:
toCTMC(spn)(marking) =
{ Action(rate(marking), marking − cond ∪ eff)
| Trn(cond, rate, eff, inh) ∈ spn,
cond ⊆ marking,
inh ∩ marking = ∅,
rate(marking) > 0 }
Tile grid → Petri net places
Agent at position p → one token at place p
Move p → q → Trn(cond={p}, rate=f, eff={q}, inh={})
Impassable tile → no transitions lead there
MSet[A] is an immutable multiset backed by Map[A, Int]. Key operations used by SPN firing:
extract(m)— removesmfromthisif possible, returningSome(remainder)disjoined(m)— inhibitor check: true iffthis ∩ m = ∅union(m)— multiset union (additive)
// SPN.toCTMC firing rule
for
Trn(cond, rate, eff, inh) <- spn
if m.disjoined(inh) // inhibitor check
r = rate(m)
if r > 0.0
out <- m.extract(cond) // pre-condition: consume tokens
yield Action(r, out.union(eff)) // produce new markingA purely random walk would eventually reach the goal but inefficiently. To bias the agent toward the goal without removing the stochasticity, each transition rate is multiplied by a softmax improvement factor:
rate(from → to) = terrainRate(from) × exp(β × (dist(from, goal) − dist(to, goal)))
where dist(p, goal) is the terrain-weighted shortest distance (Dijkstra on the reversed graph), and β is the temperature parameter:
β = 0→ purely random walk, rates equal terrain base ratesβ = 2→ recommended default, moderate goal biasβ = 5+→ near-deterministic greedy behaviour
Moves that decrease distance to the goal have a positive exponent → rate boosted. Moves that increase distance have a negative exponent → rate suppressed. The CTMC still fires randomly; β only shifts the probability mass.
Distances are pre-computed once by running Dijkstra backwards from the goal on the reversed adjacency graph.
This fills the entire Map[Position, Double] in one pass — O(|V| log |V|) — and is shared across all agents with the same goal.
Three oracle implementations: BFS (unweighted hop count), TerrainBFS (Dijkstra with terrain entry costs, the default), and AStar (also exposes a single-source query for targeted lookups).
The Distributed Asynchronous stochastic Petri net (DAP) extends SPN to a network of nodes, each with its own local marking, communicating via broadcast messages.
A DAP rule reads:
pre --rateExp(localMarking)--> eff | ^msg
The global state is:
tokens: MSet[Token[ID, P]]— all tokens in the network, tagged by node identitymessages: MSet[Token[ID, P]]— pending broadcast messagesneighbours: Map[ID, Set[ID]]— network topology
Two types of CTMC actions emerge:
Rule firing at a node (stochastic): a node's local marking matches pre; it fires at rate rateExp(localMarking), replacing pre with eff and broadcasting msg to all neighbours.
Message delivery (instantaneous, rate = ∞): a pending message token is delivered to all neighbours of the sender and removed from the queue.
Each agent is a DAP node. The "network topology" is all-to-all (every agent is neighbour to every other agent, though the current movement rules use no gossip — msg = ∅). Each directed edge from → to in the grid becomes a rule for every agent, with the goal-oriented rate:
rate(id, from → to) = terrainRate(from) × exp(β × (dist_id(from) − dist_id(to)))
where dist_id is the distance map pre-computed for agent id's individual goal.
Collision avoidance is enforced by filtering out any CTMC transition that would result in two agents occupying the same tile:
val ctmc = s => raw.transitions(s).filterNot(a => hasCollision(a.state))This is a hard constraint — colliding transitions are simply removed from the transition set, so the CTMC naturally routes around them.
// DAP rule firing (one node)
for
Rule(pre, rateExp, eff, msg) <- spn
nodeId <- neighbours.keySet
out <- tokens.extract(pre.map(Token(nodeId, _)))
newTokens = out.union(eff.map(Token(nodeId, _)))
rate = rateExp(localTokens(tokens, nodeId))
if rate > 0.0
yield Action(rate, State(newTokens, newMessages, neighbours))The controller (MultiAgentController) runs the CTMC trace in a background thread and pushes position/trail updates to the view via a step callback, which the Swing EDT uses to repaint the grid.
Q-learning is a model-free, off-policy temporal difference (TD) algorithm. The agent learns a state-action value function Q(s, a) — the expected cumulative discounted reward for taking action a in state s and following the optimal policy thereafter.
At each step the agent is in state s, takes action a, receives reward r, and lands in state s':
Q(s, a) ← (1 − α) · Q(s, a) + α · (r + γ · max_{a'} Q(s', a'))
α ∈ (0, 1]— learning rate: how strongly new information overwrites oldγ ∈ [0, 1]— discount factor: how much future rewards are worth relative to immediate onesr + γ · max Q(s', a')— Bellman target: immediate reward plus discounted value of the best next action
After sufficient training, Q(s, a) converges to Q(s, a)*, the optimal action-value function, and the greedy policy π*(s) = argmax_a Q*(s, a) is optimal.
During training, a pure greedy policy would fail to explore states it hasn't visited. The ε-greedy policy balances this:
π_ε(s) = { random action with probability ε
{ argmax_a Q(s,a) with probability 1 − ε
In this project ε = 0.2 by default. After training, playback uses the pure greedy policy (ε = 0).
| Outcome | Reward |
|---|---|
| Reached the goal | +100.0 |
| Moved to a new passable tile | −terrainCost(from) |
| Stayed in place (wall hit) | −1.0 (stay penalty) |
Terrain costs are the inverses of CTMC departure rates: Grass=0.1, Floor=0.2, Water=1.0, Trap=2.0. This means the agent learns not just to reach the goal but to prefer fast terrain.
The environment is a GridEnvironment.Deterministic: actions always succeed unless they hit an impassable tile, in which case the agent stays in place and pays the stay penalty.
override def apply(s: Position, a: Direction): (Reward, Position) =
val candidate = s + a.vector
val next = if isPassable(candidate) then candidate else s
(reward(s, next), next)This is the standard grid-world RL setup. The learned policy can then be compared directly against A*.
The environment is a GridEnvironment.Stochastic. Each action does not deterministically move the agent; instead it instantiates a one-step CTMC run biased toward the chosen direction:
val spf = StochasticPathfinding(
tiles, directions, s, goal,
oracle = DistanceOracle.TerrainBFS,
beta = beta + actionBias, // chosen direction extra-boosted
)
val trace = spf.simulationTrace(rnd)
val nextPos = trace.drop(1).headOption
.map(e => spf.positionOf(e.state))
.getOrElse(s)The actionBias raises the softmax temperature in the chosen direction, making it more likely the agent moves that way — but not certain. Other directions can still fire. The agent therefore trains under genuine stochastic dynamics, which is more realistic for physical or noisy environments.
The Q-table that results captures expected cumulative reward under uncertainty, not just optimal paths in a clean world.
After training, extractPath follows the greedy policy from start to goal:
def bestAction(pos: Position, forbidden: Set[Position]): Option[Direction] =
val candidates = directions
.map(d => d -> (pos + d.vector))
.filter((_, next) => passable.contains(next)) // only passable moves
.sortBy((d, next) => -q(pos, d)) // descending Q-value
candidates.find((_, next) => !forbidden.contains(next))
.orElse(candidates.headOption)
.map(_._1)Importantly, the forbidden set of already-visited positions is passed at each step so that the path-extractor can break cycles by trying the second-best passable action rather than looping indefinitely. This is essential in RL Stochastic mode where the Q-table may have convergence imperfections.
The Q-learning engine is fully generic, parameterised on State and Action types:
trait QRL:
type State; type Action; type Reward = Double
trait Q extends ((State, Action) => Reward):
def bestPolicy: Policy = s => actions.maxBy(this(s, _))
def update(s: State, a: Action, v: Reward): Q
def vFunction: State => Reward = s => actions.map(this(s, _)).maxThe concrete QFunction is a mutable Map[(State, Action), Reward] with a default value v0 for unseen pairs. The QLearning case class holds the hyperparameters and implements the Bellman update:
def updateQ(s: State, qf: Q): (State, Q) =
val a = qf.epsPolicy(epsilon)(s)
val (r, s2) = system.environment(s, a)
val bellman = (1 - alpha) * qf(s, a) + alpha * (r + gamma * qf.vFunction(s2))
(s2, qf.update(s, a, bellman))Training runs in a background daemon thread (RLController), firing an EDT callback after each episode batch to update the progress bar without blocking the UI.
Simulation is a state machine expressed as StateT[IO, State, *] (Cats Effect).
The main loop polls the current State every tick and dispatches on (previousState, currentState) transitions.
The transition objects (Resume, Pause, Reset, ConfigurationChanged, …) are Scala 3 unapply extractors, keeping the match expression in handleTransition readable:
case Resume() =>
if isRLMode then
_rlController match
case Some(rc) if rc.trainingComplete => rc.startPlayback()
case _ => startRL()
else if _currentMode == Swarm then startSwarm()
else View.resume()- Main loop — a dedicated
ThreadpollingSimulation.currentat ~60 Hz, sleeping viaSpeedManager.delay(). - RL training —
RL-Training-Thread(daemon); fires EDT callbacks viaSwing.onEDTafter each batch. - RL playback —
RL-Playback-Thread(daemon); advances_playbackIndexand fires view callbacks atplaybackDelayms per step. - Swarm simulation —
MultiAgentControllerruns the CTMCLazyListon a background thread and batches EDT updates. - Swing EDT — all view mutations go through
Swing.onEDT; the grid'spaintComponentreads only@volatilefields.


