This project implements a reinforcement learning (RL) navigation system where an agent learns to reach a goal in randomly generated maze environments with moving obstacles. The system starts with a tabular Q-learning baseline and is later upgraded to a Deep Q-Network (DQN) with stabilization techniques such as experience replay and a target network.
The objective is to study how an RL agent can learn goal-directed navigation in stochastic and dynamic environments, which is a core problem in robotics and autonomous systems. To stabilize learning and improve navigation performance, the environment has been recently tuned with custom obstacle density and slower obstacle dynamics, alongside optimized exploration rates.
- Random maze generation every episode
- Reduced environment difficulty with sparser obstacles (
size // 2) - Slower dynamic obstacles that move every 2 agent steps to stabilize learning
- Optimized extended exploration parameters (
epsilon_decay=0.999,epsilon_min=0.02) - Local obstacle sensors leading to goal-directed state representations
- Reward shaping for efficient learning
- Tabular Q-learning baseline
- Deep Q-Network (DQN) implementation variations
- Experience replay buffer and target network stabilization (
dqn_agent_v2.py) - Evaluation and visualization scripts for multiple models
The agent must learn to navigate a grid-based maze to reach a goal while avoiding obstacles.
Challenges in the environment include:
- Random maze layouts
- Moving obstacles
- Partial observation
- Stochastic dynamics
These conditions make the task significantly more difficult than classical gridworld environments.
The environment is a 2D grid maze.
Example layout:
A . . X .
. . X . .
. . . . .
. X . . .
. . . . G
Legend:
A = Agent
G = Goal
X = Obstacle
. = Free cell
Each episode generates a new maze configuration with random obstacle placement.
state = env.reset(seed=episode)This forces the agent to learn general navigation strategies instead of memorizing paths.
Obstacles move randomly during the episode. To make the learning curve manageable and realistic:
- Obstacles are placed dynamically, equal to
self.size // 2. - Obstacles move once every 2 steps (instead of every step), providing the agent with improved reaction capability and less chaotic environments.
Agent moves → (Every 2 steps: Obstacles move) → Collision check
The agent observes a compact state representation consisting of:
(dx, dy, obstacle_up, obstacle_down, obstacle_left, obstacle_right)
Where:
dx = goal_x - agent_x
dy = goal_y - agent_y
Obstacle sensors:
obstacle_up
obstacle_down
obstacle_left
obstacle_right
Example state:
(3, -1, 0, 1, 0, 0)
Meaning:
- Goal is 3 cells right, 1 cell up
- Obstacle detected below
The agent can perform four discrete actions:
0 → move up
1 → move down
2 → move left
3 → move right
The reward strategy combines sparse rewards with reward shaping.
| Event | Reward |
|---|---|
| Move closer to goal | +1 |
| Move away from goal | -1 |
| Step penalty | -1 |
| Collision with obstacle | -100 |
| Reach goal | +100 |
Baseline algorithm using a dictionary-based Q-table.
Update rule:
Q(s,a) = Q(s,a) + α [ r + γ max(Q(s',a')) − Q(s,a) ]
Replaces the Q-table with a Neural Network. State is fed directly into dual 64-layer MLPs resolving into Q-values. Evaluates directly against an online Q-value target.
To overcome DQN instability limitations, this includes:
- Experience Replay Buffer: Breaking temporal correlation and improving sampling capability.
- Target Network: Creating stability iteratively for network evaluations.
rl_dynamic_maze/
env/
maze_env.py # Core environment simulation
agents/
q_learning.py # Tabular Q-Learning baseline
dqn_agent.py # Vanilla DQN agent
dqn_agent_v2.py # Advanced DQN with Experience Replay & Target Network
# Training Scripts
train.py # Trains Q-Learning Agent
train_dqn.py # Trains DQN Agent
train_dqn.ipynb # Jupyter Notebook for interactive DQN training
# Evaluation & Testing Scripts
evaluate.py # Evaluates Q-Learning on generalized scenarios
evaluate_dqn.py # Evaluates DQN on generalized scenarios
test_trained_agent.py # Watch the trained Q-Learning agent navigate (render)
test.py # Environment interaction mechanics test script
# Artifacts
q_table.pkl # Q-table weights (generated after train.py)
dqn_weights.pth # Neural network weights (generated after train_dqn.py)
README.md
Option A: Train the Q-Learning Agent
python train.pyOption B: Train the DQN Agent
python train_dqn.pyEvaluates the agents against unseen random maze environments (testing for generalizability):
python evaluate.py # For Q-Learning
python evaluate_dqn.py # For DQNRender the environment dynamically and watch the trained Q-Learning agent play:
python test_trained_agent.pyJust output the current grid mechanics using fixed manual actions:
python test.pyWe track performance using:
- Success rate
- Collision rate
- Average episode reward
Recent modifications adjusting movement penalties and standardizing obstacle speeds have stabilized the curve drastically, leading to elevated success rates and sustained training optimizations in Q-Learning and Deep network variations.
- Double DQN implementations
- Prioritized experience replay mapping
- Dueling Deep Q-Networks
- Multi-agent obstacle environments
- Vision-based state representation