This repository includes a collection of projects developed during the Reinforcement Learning (RL) course at the National Polytechnic University of Armenia (NPUA). The projects integrate theoretical knowledge with practical applications, aiming to develop a deep understanding of decision-making systems. The work is based on the latest methodologies in reinforcement learning research.
Reinforcement Learning (RL) studies how agents learn to make decisions by interacting with an environment. The agent chooses actions, observes the outcomes, and adapts its strategy to maximize cumulative reward. Unlike traditional supervised learning, RL relies on trial and error rather than labeled data. It is widely applied in robotics, games, and autonomous systems where sequential decision-making is crucial.
Description: Demonstrates how an agent learns to play optimally via self-play and tabular value updates, showcasing core RL concepts like value estimation, policy improvement, and strategic convergence.
Description: Simulates a multi-armed bandit problem to compare different action-selection strategies such as ε-greedy, UCB, and gradient methods. The project visualizes average rewards and optimal action rates over time to build intuition around exploration-exploitation dynamics.
Description: Implements a simple grid-based Markov Decision Process (MDP) to visualize value functions and policies. Includes simulations under both random and optimal policies using iterative evaluation and value iteration. Ideal for understanding how rewards and transitions shape decision-making in finite environments.
Description: Solves a 4x4 gridworld task using policy evaluation, improvement, and iteration. Demonstrates how an agent can learn the optimal policy by computing value functions iteratively. Great for understanding dynamic programming methods in episodic, undiscounted environments.
Description: Solves an episodic betting task using value iteration to find the optimal policy. Demonstrates how a gambler can maximize returns by betting strategically in a biased coin-flip game. Highlights stochasticity, policy convergence, and the balance between risk and reward.
Description: Models Blackjack as an episodic MDP and applies Monte Carlo techniques for policy evaluation, improvement, and off-policy learning. Demonstrates learning optimal strategies through Exploring Starts and evaluates sampling efficiency with ordinary vs. weighted importance sampling.
Description: Illustrates instability in ordinary importance sampling due to infinite variance in off-policy Monte Carlo estimation. Demonstrates how weighted importance sampling provides more stable convergence when estimating value functions under stochastic looping behavior.
Description: This project compares TD(0) and Monte Carlo prediction methods in a simple Markov Reward Process. It highlights the trade-offs between bootstrapping (TD) and full-return updates (MC) in estimating state values from experience.
Description: This project uses the on-policy TD control algorithm SARSA to solve the Windy Gridworld environment. The agent learns to reach a goal efficiently despite wind disturbances. It demonstrates how exploration and temporal-difference learning can produce near-optimal paths in dynamic, stochastic environments.
Description: This project compares SARSA (on-policy) and Q-learning (off-policy) in the Cliff Walking gridworld. It shows how SARSA learns a safer path by accounting for exploration, while Q-learning aims for optimality but risks falling into the cliff. The experiment demonstrates the trade-off between safety during learning and final policy performance under ε-greedy exploration.
Description: Explores the phenomenon of maximization bias in reinforcement learning. Demonstrates how estimating the maximum expected value from samples can introduce a positive bias, and how Double Q-learning mitigates this issue. The project highlights the difference between overestimation and unbiased value estimation in action-value methods.
Description: Implements n-step TD prediction on the 1000-state random walk example. The project compares TD(0), Monte Carlo, and intermediate n-step methods, showing how bootstrapping multiple steps improves learning speed and reduces variance. Visualization includes learning curves over different n-step sizes.
Description: Demonstrates model-based RL using simple maze environments. Implements Dyna-Q, prioritized sweeping, and basic planning algorithms. Highlights how simulated experience (planning) combined with real experience accelerates learning optimal paths in stochastic environments.
Description: Compares Monte Carlo and temporal-difference (TD) updates in a controlled setting. The project illustrates differences in bias, variance, and learning speed between the two approaches, using simple Markov Reward Processes for demonstration.
Description: Implements trajectory sampling methods to approximate value functions in episodic tasks. Demonstrates how sampled trajectories can be used for policy evaluation, and explores the trade-offs between sample efficiency and variance in predictions.
Description: Extends the 1000-state random walk example using linear function approximation. Implements polynomial and Fourier bases to represent value functions. Highlights challenges in generalization and feature design, and compares learning curves for different basis functions.
Description: Demonstrates coarse coding (tile coding) for linear function approximation on a 1D square-wave function. Explores the effect of feature width on generalization and learning. Wider features produce smooth generalization, whereas narrow features produce localized, bumpy estimates. The project highlights the role of receptive field design in function approximation.
Description: Solves the classic Mountain Car control task using semi-gradient SARSA with tile-coded function approximation. The experiment illustrates how an agent learns to escape a gravity-driven valley by building momentum. Highlights sparse rewards, continuous state spaces, feature construction, and sensitivity to step-size.
Description: Implements the Access-Control Queuing Task, a stochastic resource-management problem from Sutton & Barto. Compares REINFORCE, Actor–Critic, and differential reward formulations to evaluate their handling of delayed feedback and high-variance policy gradients. Demonstrates optimal acceptance/rejection behavior in high-load environments.
Description: Reproduces several counterexamples from Chapter 12 demonstrating the pitfalls of eligibility traces in TD control. Compares accumulating, replacing, dutch, clearing, and true-online traces on tile-coded Mountain Car. Shows where traditional traces diverge or behave unexpectedly, while true-online variants remain stable.
Description: Analyzes eligibility traces in prediction tasks using the 19-state Random Walk. Compares TD(λ), true-online TD(λ), and off-line λ-return methods. Demonstrates how λ influences bias–variance and reproduces key theoretical curves from the literature.
Description: Extends the Mountain Car task using SARSA(λ) with different eligibility traces: accumulating, replacing, dutch, and true-online. Examines their impact on credit assignment, sensitivity to step-size, and stability of learning in continuous state spaces with tile coding.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.).
🔗 Read the full book (PDF)
This repository is intended as a companion to theoretical learning, providing hands-on experience with classical reinforcement learning algorithms and concepts. Each project serves as a stepping stone toward mastering more advanced topics in AI and autonomous decision-making.