NPUA Reinforcement Learning Course Projects

Overview

This repository includes a collection of projects developed during the Reinforcement Learning (RL) course at the National Polytechnic University of Armenia (NPUA). The projects integrate theoretical knowledge with practical applications, aiming to develop a deep understanding of decision-making systems. The work is based on the latest methodologies in reinforcement learning research.

What is Reinforcement Learning?

Reinforcement Learning (RL) studies how agents learn to make decisions by interacting with an environment. The agent chooses actions, observes the outcomes, and adapts its strategy to maximize cumulative reward. Unlike traditional supervised learning, RL relies on trial and error rather than labeled data. It is widely applied in robotics, games, and autonomous systems where sequential decision-making is crucial.

Projects

Project 1: Tic-Tac-Toe

Description: Demonstrates how an agent learns to play optimally via self-play and tabular value updates, showcasing core RL concepts like value estimation, policy improvement, and strategic convergence.

Project 2: 10-armed-testbed

Description: Simulates a multi-armed bandit problem to compare different action-selection strategies such as ε-greedy, UCB, and gradient methods. The project visualizes average rewards and optimal action rates over time to build intuition around exploration-exploitation dynamics.

Project 3: Gridworld - MDP

Description: Implements a simple grid-based Markov Decision Process (MDP) to visualize value functions and policies. Includes simulations under both random and optimal policies using iterative evaluation and value iteration. Ideal for understanding how rewards and transitions shape decision-making in finite environments.

Project 4: Gridworld - Dynamic Programming

Description: Solves a 4x4 gridworld task using policy evaluation, improvement, and iteration. Demonstrates how an agent can learn the optimal policy by computing value functions iteratively. Great for understanding dynamic programming methods in episodic, undiscounted environments.

Project 5: Gambler’s Problem – Value Iteration

Description: Solves an episodic betting task using value iteration to find the optimal policy. Demonstrates how a gambler can maximize returns by betting strategically in a biased coin-flip game. Highlights stochasticity, policy convergence, and the balance between risk and reward.

Project 6: Blackjack - Monte Carlo Methods

Description: Models Blackjack as an episodic MDP and applies Monte Carlo techniques for policy evaluation, improvement, and off-policy learning. Demonstrates learning optimal strategies through Exploring Starts and evaluates sampling efficiency with ordinary vs. weighted importance sampling.

Project 7: Infinite - variance

Description: Illustrates instability in ordinary importance sampling due to infinite variance in off-policy Monte Carlo estimation. Demonstrates how weighted importance sampling provides more stable convergence when estimating value functions under stochastic looping behavior.

Project 8: Random Walk — TD(0) vs Monte Carlo

Description: This project compares TD(0) and Monte Carlo prediction methods in a simple Markov Reward Process. It highlights the trade-offs between bootstrapping (TD) and full-return updates (MC) in estimating state values from experience.

Project 9: Windy Gridworld — SARSA

Description: This project uses the on-policy TD control algorithm SARSA to solve the Windy Gridworld environment. The agent learns to reach a goal efficiently despite wind disturbances. It demonstrates how exploration and temporal-difference learning can produce near-optimal paths in dynamic, stochastic environments.

Project 10: Cliff-walking

Description: This project compares SARSA (on-policy) and Q-learning (off-policy) in the Cliff Walking gridworld. It shows how SARSA learns a safer path by accounting for exploration, while Q-learning aims for optimality but risks falling into the cliff. The experiment demonstrates the trade-off between safety during learning and final policy performance under ε-greedy exploration.

Project 11: Maximization Bias

Description: Explores the phenomenon of maximization bias in reinforcement learning. Demonstrates how estimating the maximum expected value from samples can introduce a positive bias, and how Double Q-learning mitigates this issue. The project highlights the difference between overestimation and unbiased value estimation in action-value methods.

Project 12: Random Walk — N-step TD

Description: Implements n-step TD prediction on the 1000-state random walk example. The project compares TD(0), Monte Carlo, and intermediate n-step methods, showing how bootstrapping multiple steps improves learning speed and reduces variance. Visualization includes learning curves over different n-step sizes.

Project 13: Mazes — Planning and Learning

Description: Demonstrates model-based RL using simple maze environments. Implements Dyna-Q, prioritized sweeping, and basic planning algorithms. Highlights how simulated experience (planning) combined with real experience accelerates learning optimal paths in stochastic environments.

Project 14: Updates Comparison — Monte Carlo vs TD

Description: Compares Monte Carlo and temporal-difference (TD) updates in a controlled setting. The project illustrates differences in bias, variance, and learning speed between the two approaches, using simple Markov Reward Processes for demonstration.

Project 15: Trajectory Sampling

Description: Implements trajectory sampling methods to approximate value functions in episodic tasks. Demonstrates how sampled trajectories can be used for policy evaluation, and explores the trade-offs between sample efficiency and variance in predictions.

Project 16: Random Walk — Function Approximation

Description: Extends the 1000-state random walk example using linear function approximation. Implements polynomial and Fourier bases to represent value functions. Highlights challenges in generalization and feature design, and compares learning curves for different basis functions.

Project 17: Coarse Coding

Description: Demonstrates coarse coding (tile coding) for linear function approximation on a 1D square-wave function. Explores the effect of feature width on generalization and learning. Wider features produce smooth generalization, whereas narrow features produce localized, bumpy estimates. The project highlights the role of receptive field design in function approximation.

Project 18: Mountain Car

Description: Solves the classic Mountain Car control task using semi-gradient SARSA with tile-coded function approximation. The experiment illustrates how an agent learns to escape a gravity-driven valley by building momentum. Highlights sparse rewards, continuous state spaces, feature construction, and sensitivity to step-size.

Project 19: Access Control

Description: Implements the Access-Control Queuing Task, a stochastic resource-management problem from Sutton & Barto. Compares REINFORCE, Actor–Critic, and differential reward formulations to evaluate their handling of delayed feedback and high-variance policy gradients. Demonstrates optimal acceptance/rejection behavior in high-load environments.

Project 20: Counter Examples

Description: Reproduces several counterexamples from Chapter 12 demonstrating the pitfalls of eligibility traces in TD control. Compares accumulating, replacing, dutch, clearing, and true-online traces on tile-coded Mountain Car. Shows where traditional traces diverge or behave unexpectedly, while true-online variants remain stable.

Project 21: Random Walk ET

Description: Analyzes eligibility traces in prediction tasks using the 19-state Random Walk. Compares TD(λ), true-online TD(λ), and off-line λ-return methods. Demonstrates how λ influences bias–variance and reproduces key theoretical curves from the literature.

Project 22: Mountain Car ET

Description: Extends the Mountain Car task using SARSA(λ) with different eligibility traces: accumulating, replacing, dutch, and true-online. Examines their impact on credit assignment, sensitivity to step-size, and stability of learning in continuous state spaces with tile coding.

Reference

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.).
🔗 Read the full book (PDF)

Educational Objective

This repository is intended as a companion to theoretical learning, providing hands-on experience with classical reinforcement learning algorithms and concepts. Each project serves as a stepping stone toward mastering more advanced topics in AI and autonomous decision-making.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
access-control		access-control
blackjack		blackjack
cliff-walking		cliff-walking
coarse-coding		coarse-coding
counter-examples		counter-examples
gambler-problem		gambler-problem
gridworld-dp		gridworld-dp
gridworld-mdp		gridworld-mdp
infinite-variance		infinite-variance
maximization-bias		maximization-bias
mazes		mazes
mountain-car-et		mountain-car-et
mountain-car		mountain-car
random-walk-et		random-walk-et
random-walk-fa		random-walk-fa
random-walk-ntd		random-walk-ntd
random-walk		random-walk
ten-armed-testbed		ten-armed-testbed
tic-tac-toe		tic-tac-toe
trajectory-sampling		trajectory-sampling
updates-comparison		updates-comparison
windy-gridworld		windy-gridworld
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

NPUA Reinforcement Learning Course Projects

Overview

What is Reinforcement Learning?

Projects

Reference

Educational Objective

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Uh oh!

Languages