A comprehensive framework for exploring multi-agent reinforcement learning in a novel four-player Mancala variant using Q-learning with experience replay.
- Overview
- Getting Started
- Key Features
- Research Contributions
- Game Rules
- Experimental Framework
- Algorithm Details
- Results Summary
- Project Structure
- Usage Guide
- Dependencies
- Performance Notes
- Limitations and Future Work
- Statistical Validation
- Citation
- License
- Acknowledgments
- Contact
- Copyright Disclaimer
This project implements and evaluates a generalized four-player Mancala game environment designed for reinforcement learning research. The framework demonstrates how Q-learning agents can master complex strategic board games, achieving win rates up to 50.4% in single-agent scenarios and 62% in dual-agent configurations—significantly exceeding theoretical baselines.
The complete implementation is provided in a self-contained Jupyter notebook that can be executed independently. All methodological details, experimental design, and theoretical foundations are documented in the accompanying research report.
The MancalaFinal.ipynb notebook contains the complete, self-contained implementation of this project. This is the recommended way to explore and run the framework.
Features of MancalaFinal.ipynb:
- ✅ Fully self-contained code requiring no external scripts
- ✅ Complete implementation of all experiments
- ✅ Integrated visualization and analysis tools
- ✅ Step-by-step execution with detailed comments
- ✅ Ready to run in Google Colab or locally
For detailed methodological information, experimental design rationale, and theoretical background, please refer to the accompanying PDF report. The report provides:
- Comprehensive literature review and theoretical foundations
- Detailed explanation of algorithmic choices and hyperparameter selection
- Complete experimental methodology and validation procedures
- In-depth analysis of results and emergent behaviors
- Discussion of limitations and future research directions
- Open Google Colab
- Upload the MancalaFinal.ipynb notebook
- Run cells sequentially from top to bottom
- All dependencies will be installed automatically
💡 For faster preliminary results: Locate training parameters and reduce episode counts from 3000 to 300-500.
# Clone or download the repository
# Navigate to the project directory
jupyter notebook MancalaFinal.ipynb- Google Colab: Web browser and Google account (no local setup required)
- Local Execution:
- Python 3.7 or higher
- Jupyter Notebook or JupyterLab
- 8GB RAM (recommended for full experiments)
- 4GB available disk space
- 🎮 Flexible Game Environment: Customizable board configurations with variable pit counts and stone distributions
- 🤖 Multi-Agent Support: Independent Q-learning agents with experience replay mechanisms
- 📊 Comprehensive Analysis: Parameter studies, positional analysis, and emergent behavior investigation
- 📈 Statistical Validation: Rigorous statistical testing with confidence intervals
- 🎨 Visualization Pipeline: Publication-quality plots and heatmaps for result analysis
- 📓 Self-Contained Implementation: Complete framework in a single executable notebook
The research identifies optimal board configurations where reinforcement learning agents significantly outperform theoretical expectations:
| Configuration | Win Rate | Baseline | Improvement |
|---|---|---|---|
| 6 pits, 2 stones | 50.4% | 25% random | +101.6% |
| Adjacent positioning (Players 1&2) | 62% combined | 50% theoretical | +24% |
- 🤝 Pseudo-cooperative dynamics: Independently trained agents develop complementary strategies without explicit coordination
- 🎯 Positional advantages: Early turn positions provide consistent strategic benefits
- 📉 Non-linear complexity effects: Performance varies non-monotonically with game parameters
- 🧩 Strategic pattern recognition: Agents learn to identify and exploit board states for optimal stone capture
Player 4 [R] P3 P2 P1 [R] Player 1
P8 P7 P6 P5 P4 P3
P1 P2 P3 P4 P5 P6
Player 3 [R] P1 P2 P3 [R] Player 2
Setup: Each player controls N pits initially containing M stones, plus one reservoir (scoring pit)
Turn Sequence: Players take turns counterclockwise, distributing stones one per pit
Special Rules:
- ❌ Skip opponents' reservoirs when distributing stones
- ➕ Land in your reservoir → earn an extra turn
- 🎯 Land in an empty pit on your side → capture stones from the opposite player's corresponding pit
Victory Condition: Game ends when any player's pits become empty; player with most stones in reservoir wins
| Feature | Traditional (2-Player) | Four-Player Variant |
|---|---|---|
| Capture Rule | Capture from all opponents | Only from directly opposite player |
| Game End | One player's side empty | Any player's side empty |
| Dynamics | Two-player zero-sum | Four-player positional strategy |
| Stone Distribution | Continuous around board | Skip opposing reservoirs |
Systematic exploration of board configurations:
- Pit counts: 3, 4, 6, 8 pits per player
- Stone counts: 2, 3, 4, 6 initial stones per pit
- Total configurations: 16 distinct environments tested
- Episodes per configuration: 3,000 training episodes
| Scenario | Agent Configuration | Purpose |
|---|---|---|
| Single agent | 1 RL agent vs 3 random players | Baseline performance establishment |
| Dual agents | 2 RL agents vs 2 random players | Cooperative/competitive dynamics analysis |
| Positional studies | Adjacent vs opposite agent placement | Strategic position advantage quantification |
- Win rate percentage (primary metric)
- Average reward accumulation
- Stone capture efficiency
- Extra turn exploitation rate
- Convergence speed analysis
The framework employs tabular Q-learning with experience replay for stable convergence:
# Core hyperparameters
learning_rate = 0.1 # α - Learning rate
discount_factor = 0.95 # γ - Future reward discount
exploration_strategy = "ε-greedy" # Exploration policy
epsilon_decay = {
"initial": 1.0,
"minimum": 0.01,
"decay_rate": "exponential"
}
experience_replay = {
"buffer_size": 10000, # Memory capacity
"batch_size": 32, # Training batch size
"sampling": "uniform_random" # Replay sampling strategy
}States are encoded as tuples capturing:
- Stone counts in all pits for all players
- Stone counts in all reservoirs
- Current player turn indicator
Valid actions correspond to selecting non-empty pits on the current player's side.
| Event | Reward Value | Rationale |
|---|---|---|
| Stone to reservoir | +0.5 per stone | Encourages incremental progress |
| Stone capture | +2.0 per stone | Rewards strategic captures |
| Extra turn gained | +3.0 bonus | Incentivizes turn extension |
| Game win | Variable (rank-based) | Final outcome reinforcement |
For complete algorithmic details and theoretical justification, refer to the methodology section in the accompanying PDF report.
- Best single-agent win rate: 50.4% (6 pits, 2 stones configuration)
- Best dual-agent combined rate: 62% (adjacent positioning, 6 pits, 2 stones)
- Statistical significance: All major findings validated with p < 0.05
- Training convergence: Stable performance achieved within 3,000 episodes
- Theoretical baseline: 25% random agent win rate in four-player setting
- Non-monotonic complexity relationship: Intermediate complexity configurations (6 pits, 2 stones) create optimal learning conditions
- Positional advantages: Early turn order provides measurable strategic benefits (~5-10% win rate improvement)
- Emergent cooperation: Independently trained agents develop mutually beneficial strategies without explicit communication
- Configuration sensitivity: Small parameter changes create dramatic differences in learning outcomes
- Sparse reward challenges: Experience replay proves essential for stable convergence
All results include:
- 95% confidence intervals
- Binomial tests for win rate significance
- T-tests for score difference validation
- Multiple testing corrections where appropriate
For detailed statistical analysis and comprehensive results discussion, consult the research report PDF.
📁 UnderGraduateThesis/
├── 📓 MancalaFinal.ipynb # PRIMARY: Complete self-contained implementation
├── 📄 Report.pdf # Comprehensive methodology and analysis document
├── 📄 README.md # This file
- Open MancalaFinal.ipynb in Jupyter or Google Colab
- Execute cells sequentially from top to bottom
- Monitor training progress through integrated progress bars
- View results in automatically generated plots and tables
The notebook is structured with clearly marked sections:
# Example: Modify training parameters
TRAINING_EPISODES = 3000 # Reduce for faster testing
LEARNING_RATE = 0.1 # Adjust learning dynamics
BUFFER_SIZE = 10000 # Experience replay capacity
# Example: Test custom board configuration
custom_config = {
'pits': 5,
'stones': 3,
'label': 'Custom_5x3'
}The notebook supports three primary experimental workflows:
- Parameter Study: Systematic evaluation across all pit/stone combinations
- Positional Analysis: Agent placement effect quantification
- Dual Agent Study: Multi-agent interaction exploration
Each workflow is documented within the notebook. For theoretical background and design rationale, refer to the methodology section of the PDF report.
| Package | Version | Purpose |
|---|---|---|
numpy |
≥1.19.0 | Numerical computing and array operations |
matplotlib |
≥3.3.0 | Plotting and visualization |
gymnasium |
≥0.26.0 | RL environment framework |
pandas |
≥1.1.0 | Data manipulation and analysis |
scipy |
≥1.5.0 | Statistical testing and analysis |
tqdm |
≥4.50.0 | Progress bar visualization |
seaborn |
≥0.11.0 | Statistical data visualization |
The MancalaFinal.ipynb notebook includes automatic dependency installation. For manual setup:
pip install numpy matplotlib gymnasium pandas scipy tqdm seaborn- Python: 3.7, 3.8, 3.9, 3.10, 3.11
- Operating Systems: Windows, macOS, Linux
- Jupyter: Notebook 6.x or JupyterLab 3.x
- ⏱️ Training time: ~5-10 minutes per configuration (3,000 episodes on standard hardware)
- 💾 Memory usage: Sparse Q-table representation manages memory efficiently (~500MB peak)
- 📈 Scalability: Framework tested up to 8 pits, 6 stones (state space ~10^16)
- 🖥️ Hardware: CPU-only implementation (no GPU required)
- Reduce training episodes for preliminary testing (300-500 episodes)
- Use smaller board configurations (3-4 pits) for rapid prototyping
- Enable progress bars to monitor convergence
- Batch experiment execution for efficiency
- Fixed training duration: 3,000 episodes may be suboptimal for some configurations
- Random baseline evaluation: Limited comparison against sophisticated opponents
- Independent training: Agents trained separately, not simultaneously
- First-player advantage: Potential positional bias requires further investigation
- Tabular approach: State space explosion limits scalability to larger boards
- Deep Q-Networks (DQN): Neural network function approximation for larger state spaces
- Policy gradient methods: Actor-critic architectures for continuous improvement
- Monte Carlo Tree Search: Hybrid planning-learning approaches
- Simultaneous training: Co-evolutionary agent development
- Communication protocols: Explicit agent coordination mechanisms
- Competitive training: Agents trained against evolving opponents
- Human player analysis: Strategy comparison with expert human play
- Alternative algorithms: Comparative evaluation (SARSA, PPO, A3C)
- Transfer learning: Strategy generalization across board configurations
- Positional advantage quantification: Mathematical formalization of turn order effects
- Convergence guarantees: Theoretical bounds on learning performance
- Optimal strategy characterization: Game-theoretic analysis
For detailed discussion of limitations and future work, see the research report PDF.
All reported results undergo rigorous statistical validation:
- ✅ Confidence intervals: 95% confidence bounds for all performance metrics
- 📊 Hypothesis testing: Binomial tests for win rates, t-tests for score differences
- 📏 Effect sizes: Practical significance assessment beyond statistical significance
- 🔧 Multiple testing corrections: Bonferroni adjustments for family-wise error control
- 🎲 Reproducibility: Random seed control for experiment replication
- Statistical significance: p < 0.05
- Practical significance: Effect size > 0.3 (Cohen's d)
- Confidence level: 95% for all interval estimates
Complete statistical procedures and validation protocols are documented in the PDF report.
If you use this framework in your research, please cite:
@techreport{sharan2025mancala,
title={Exploring Multi-Agent Reinforcement Learning in a Four-Player Mancala Variant},
author={Sharan, M.},
institution={King's College London},
year={2025},
type={Final Project Report},
note={Complete implementation available in MancalaFinal.ipynb}
}This project is released under the MIT License. See the LICENSE file for full legal details.
- ✅ Academic and educational use
- ✅ Research and experimentation
- ✅ Extension and modification (with attribution)
- ❌ Commercial use without permission
- ❌ Reproduction without citation
- 🎓 King's College London Department of Informatics for academic support
- 🔧 OpenAI Gymnasium framework developers for the RL environment foundation
- 🎮 Gary MacLeod for the original four-player Mancala variant design
- 📚 Reinforcement Learning Community for algorithmic insights and best practices
For questions about the implementation, methodology, or research findings:
- Technical inquiries: Refer to code documentation in MancalaFinal.ipynb
- Methodological questions: Consult the comprehensive PDF report
- Academic correspondence: Contact through appropriate institutional channels
This work and all associated materials are protected under British and Indian copyright law. The content, methodology, code implementation, research findings, documentation, and the MancalaFinal.ipynb notebook are the intellectual property of the author(s).
Unauthorized copying, reproduction, distribution, or use of any portion of this work—including but not limited to the MancalaFinal.ipynb notebook, research report, code, methodology, or findings—without explicit written permission constitutes copyright infringement and plagiarism.
Any attempt to copy, reproduce, or derive works from this content may result in:
- ⚖️ Legal action for copyright infringement under the Copyright, Designs and Patents Act 1988 (UK)
- ⚖️ Legal action for copyright infringement under the Copyright Act, 1957 (India)
- 🎓 Academic misconduct proceedings for plagiarization at institutional and professional levels
- 💰 Claims for damages, legal costs, and injunctive relief
- 📋 Disciplinary action through academic integrity committees
All of the following are protected intellectual property:
- MancalaFinal.ipynb notebook and all code therein
- Research methodology and experimental design
- Algorithm implementations and optimizations
- Documentation and written content
- Generated results, visualizations, and analysis
- Project structure and organization
For licensing inquiries, permission requests, or collaboration proposals, please contact the author through official academic channels.
With proper citation:
- Educational reference and learning
- Academic study and research
- Critical analysis and review
All other uses require explicit written permission.
© 2026 - All Rights Reserved
Protected under the Copyright, Designs and Patents Act 1988 (UK) and the Copyright Act, 1957 (India)
This framework is designed for research and educational purposes. The implementation provides a robust foundation for exploring multi-agent reinforcement learning in traditional board games and can be extended for investigating other strategic multiplayer environments.
To begin your exploration:
- Open MancalaFinal.ipynb
- Review the PDF report for theoretical background
- Execute the notebook cells sequentially
- Experiment with custom configurations
Happy researching! 🎮🤖