Itay Shaul, Lior Lotan, Ben Kapon, Ori Cohen
- Introduction
- The Game
- Genetic Algorithm (GA)
- Code Overview
- Important Functions
- How to Run
- Framing the Problem
- Experiments
- Conclusion
This project trains AI agents to play the classic Snake game using Genetic Algorithms (GA) combined with Neural Networks. The goal is to evolve neural network weights through a process inspired by natural selection, where the fittest agents (those that eat the most apples and survive longest) pass their "genes" (network weights) to the next generation.
The Snake game presents an interesting challenge for AI: the agent must learn to navigate toward food while avoiding collisions with walls, its own body, and (in advanced modes) obstacles. Unlike supervised learning where we provide correct answers, the genetic algorithm discovers effective strategies through evolutionary pressure alone.
Our project explores four key dimensions that affect learning performance:
- Population size and generation count and their impact on convergence speed
- Environmental complexity (obstacle modes) and transfer learning
- Fitness function design and agent behaviors
- Training Diversity and generalization
The Snake game is played on a grid where the player controls a snake that moves in one of four directions (UP, RIGHT, DOWN, LEFT). The objective is to eat apples that appear randomly on the grid. Each apple eaten causes the snake to grow longer. The game ends when the snake collides with:
- The wall (boundary of the grid)
- Its own body
- An obstacle (in twist mode)
We implement two primary modes:
| Mode | Description |
|---|---|
| Baseline | Classic Snake without obstacles - the control condition for experiments |
| Twist | Snake with obstacles that add environmental complexity |
The twist mode supports three obstacle policies:
The following GIFs demonstrate the evolution of agent performance across training generations:
Generation 1 - Initial Random Behavior (0 apples):
The untrained agent moves randomly with no learned behavior, quickly colliding with a wall after just 12 steps.
Generation 50 - Intermediate Learning (10 apples):
After 50 generations of evolution, the agent has learned basic food-seeking behavior. However, it gets stuck in loops, circling without making progress toward food.
Generation 250 - Final Trained Agent (141 apples - Victory!):
The fully trained agent demonstrates sophisticated navigation, efficiently collecting apples while avoiding collisions. In this run, the agent achieved victory by eating all apples and filling the entire board!
The Genetic Algorithm is a method for solving optimization problems based on natural selection, the process that drives biological evolution. The core idea is that "the strong survive" - by maintaining a population of candidate solutions and iteratively selecting the best performers, we can evolve increasingly effective solutions over generations.
flowchart TD
A[Initialize Random Population] --> B[Evaluate Fitness]
B --> C[Select Elite Individuals]
C --> D[Tournament Selection]
D --> E[Crossover - Combine Parents]
E --> F[Mutation - Add Variation]
F --> G{Max Generations?}
G -->|No| B
G -->|Yes| H[Return Best Genome]
-
Genome Representation: Each agent's neural network weights are encoded as a flat array (genome). The network has:
- Input layer: 11 features (danger sensors, heading, food direction)
- Hidden layer: Configurable size (default 16 neurons)
- Output layer: 3 actions (turn left, go straight, turn right)
-
Fitness Evaluation: Each genome is tested over multiple episodes. Fitness is computed based on:
- Apples eaten (primary objective)
- Steps survived (secondary objective)
- Custom weighting allows different agent behaviors
-
Selection: Tournament selection chooses parents - k random individuals compete, and the fittest wins the right to reproduce.
-
Crossover: Uniform crossover combines two parent genomes - each gene is randomly taken from either parent.
-
Mutation: Gaussian mutation adds random noise to genes with a configurable probability and magnitude.
-
Elitism: The top-performing individuals are preserved unchanged to prevent losing good solutions.
SnakeAgentPlayer/
├── env/ # Snake game environment
│ ├── snake_env.py # Core Snake environment with O(1) collision detection
│ ├── obstacles.py # Obstacle spawning policies (Static, Rotating, Aggregating)
│ └── rewards.py # Reward shaping functions
│
├── agent/ # Neural network agent
│ ├── encoder.py # State encoder - converts game state to feature vector (11-dim)
│ ├── nn_policy.py # Neural network policy backed by genome weights
│ └── policy.py # Policy interface and baseline policies (Random, Greedy)
│
├── ga/ # Genetic algorithm
│ ├── config.py # GA hyperparameters configuration
│ ├── evaluator.py # Fitness evaluators (Collector, Survivor, Hungry strategies)
│ ├── operators.py # Genetic operators (selection, crossover, mutation)
│ └── ga.py # Main GA class orchestrating the evolution loop
│
├── rendering/ # Visualization
│ └── visual_renderer.py # Pygame-based renderer with modern graphics
│
├── scripts/ # Training and testing utilities
│ ├── train.py # Training entry point
│ ├── play.py # Agent playback tool
│ ├── automated_training.py # Full training pipeline with checkpointing
│ ├── test_genome.py # Genome evaluation and visual playback
│
├── experiments/ # Experiment utilities
│ ├── config_loader.py # YAML configuration loading
│ ├── logging.py # Training progress logging
│ └── plotting.py # Results visualization
│
├── images/ # Experiment result images and GIFs
├── setup.py # Package installation configuration
├── Makefile # Command shortcuts
└── requirements.txt # Python dependencies
Genetic Algorithm Functions:
GeneticAlgorithm.run(): Execute the evolutionary loop and return the best genome found.tournament_select(): Select a parent genome via tournament selection.uniform_crossover(): Create offspring genomes by mixing two parents (uniform crossover).gaussian_mutate(): Mutate a genome by adding Gaussian noise to individual genes.
Evaluation and Fitness Functions:
SnakeEvaluator.evaluate_genome(): Run a genome for multiple episodes and compute its fitness plus performance metrics.BalancedFitnessEvaluator: Fitness variant that emphasizes survival (used for the “Survivor” strategy).HungryFitnessEvaluator: Fitness variant that emphasizes step efficiency (used for the “Hungry” strategy).TwistModeEvaluator: Twist-mode evaluator that averages performance across more episodes for obstacle environments.
Environment Functions:
SnakeEnv.reset(): Reset the environment to a new episode (optionally with a deterministic seed).SnakeEnv.step(): Advance the game by one action and return(state, reward, done, info).AggregatingObstaclePolicy.on_food_eaten(): Add new obstacles after each apple (obstacles accumulate).RotatingObstaclePolicy.on_food_eaten(): Add new obstacles while keeping a maximum number (oldest rotates out).StaticObstaclePolicy.on_reset(): Spawn a fixed set of obstacles at the start of an episode.DefaultRewardFn.reward(): Default reward shaping (+apple reward, optional step reward, death penalty).
Agent and Policy Functions:
Encoder.encode(): Convert anEnvStateinto an 11-dimensional feature vector for the neural network.NeuralNetPolicy.act(): Choose an action (left/straight/right) from encoded features using a feedforward network.genome_size(): Compute the parameter count required for a givenNetworkArch.random_genome(): Initialize a random genome with scaled (Xavier-like) weights.unpack_genome(): Unpack a flat genome into(W1, b1, W2, b2)tensors.
Rendering and Visualization Functions:
SnakeRenderer.render(): Render the current game state with pygame.test_genome(): Load a genome + config and run evaluation episodes (optionally with visualization).
Training and CLI Entry Points:
AutomatedTrainingRunner.run(): End-to-end training pipeline with checkpointing, metrics, and best-genome saving.scripts/train.py:main(): CLI wrapper that runsAutomatedTrainingRunnerfor a given YAML config.scripts/play.py:main(): CLI wrapper that loads a saved genome and callstest_genome()for playback.
Data Files:
configs/*.yaml: Training configurations (environment + GA hyperparameters).runs/<run_name>/*.npy: Saved genomes (checkpoints likegen_<N>.npyandbest_genome.npy).runs/<run_name>/metrics.csv: Per-generation training metrics.runs/<run_name>/checkpoints_metadata.json: Metadata for saved checkpoints (fitness, apples, steps, death reasons).runs/<run_name>/config_used.yaml: Exact configuration snapshot used for that run.
- Python 3.11 or 3.12 (required)
- Python 3.13+ may have compatibility issues with pygame on Windows
Mac/Linux:
git clone <repository-url>
cd SnakeAgentPlayer
python3 -m venv venv
source venv/bin/activate
make installWindows:
git clone <repository-url>
cd SnakeAgentPlayer
python -m venv venv
venv\Scripts\activate
make installAfter installation, run make help to see all available commands.
make train CONFIG=configs/winner.yamlThe training will output:
runs/<experiment_name>/best_genome.npy- Trained agent weightsruns/<experiment_name>/config_used.yaml- Configuration usedruns/<experiment_name>/metrics.csv- Training statisticsruns/<experiment_name>/training_progress.png- Performance plot
# Watch the latest trained agent
make play
# Watch a specific experiment
make play RUN=winner
# Customize playback
make play RUN=winner EPISODES=10 FPS=20| Command | Description |
|---|---|
make help |
Show all available commands and options |
make install |
Install package and dependencies |
make train CONFIG=<file> |
Train with specific config file |
make play [OPTIONS] |
Watch trained agent |
make clean |
Remove training runs and cache |
Play Options (all optional, smart defaults):
| Option | Description |
|---|---|
RUN=<name> |
Experiment name (default: latest run) |
GENOME=<file> |
Genome to play (default: best_genome.npy from RUN) |
CONFIG=<file> |
Environment config (default: config_used.yaml from RUN) |
TWIST=<mode> |
Twist mode (overrides config if set) |
aggregating - Obstacles accumulate over time |
|
rotating - Max 3 obstacles, oldest rotates out |
|
static - 3 random fixed obstacles |
|
EPISODES=<n> |
Number of episodes (default: 3) |
FPS=<n> |
Playback speed (default: 10) |
VISUAL=0 |
Disable visualization for faster testing (default: enabled) |
Run make help to see usage examples.
env:
width: 12 # Grid width (cells)
height: 12 # Grid height (cells)
twist: false # Enable twist mode with obstacles (boolean)
max_steps: 2000 # Optional: max steps per episode (timeout)
starvation_steps: 200 # Optional: steps without food before starvation
# Used only when twist: true (defaults shown below)
obstacle_policy:
mode: aggregating # aggregating | rotating | static
min_distance_from_head: 3
obstacles_per_food: 1 # aggregating-only
max_obstacles: 3 # rotating-only
num_obstacles: 3 # static-only
agent:
network:
hidden_dim: 16 # Neural network hidden layer size
ga:
pop_size: 50 # Population size
generations: 200 # Number of generations
episodes_per_genome: 5 # Episodes to evaluate each genome (averaged)
seed: 123 # Random seed (reproducibility)
# Fitness function type (optional; defaults to "collector")
# Supported: collector (aka greedy), survivor (aka balanced), hungry
fitness_type: collector
# GA operator hyperparameters (all optional; defaults shown below)
elite_frac: 0.05
tournament_k: 5
crossover_rate: 0.7
mutation_rate: 0.1
mutation_sigma: 0.2
logging:
output_dir: runs # Where training runs are savedOur research investigates how different parameters and design choices affect the learning performance of genetic algorithms in the Snake game domain. We frame this as four distinct experimental questions:
-
Population Size and Generation Count: What is the optimal combination of population size and number of generations for balancing exploration, convergence speed, and final performance?
-
Environmental Complexity and Transfer Learning: How does training environment complexity (obstacle modes) affect robustness, seed sensitivity, and transfer learning across different configurations?
-
Reward Shaping and Emergent Behavior: How do different fitness functions shape the behavior and performance of evolved Snake agents?
-
Overfitting vs. Generalization: How does the diversity of training environments (number of training seeds) affect the agent's ability to generalize to unseen environments?
Each experiment is designed with:
- Independent variable: The parameter being tested
- Dependent variable: Performance metrics (apples eaten, survival time)
- Controlled variables: All other parameters held constant
What is the optimal balance between population size and generation count for maximizing learning speed and final performance?
We hypothesize that:
- Larger populations will learn faster due to increased genetic diversity
- More generations allow for continued improvement, but with diminishing returns
- There exists an optimal combination where adding more generations or population provides no benefit
- Smaller populations might need more generations to converge
| Variable | Values |
|---|---|
| Independent | Population size: 20, 50, 100, 150 |
| Independent | Generation count: 50, 150, 250, 325 |
| Dependent | Mean apples eaten by best agent |
| Controlled | All other GA parameters (seed, mutation rate, etc.) |
We test all combinations of population sizes and generation counts to find the optimal balance.
Total: 16 experiments
Note: Each configuration was tested across multiple random seeds, and results were averaged to ensure statistical reliability and reduce variance from lucky/unlucky initial conditions.
| Experiment | Final Apples | Final Fitness |
|---|---|---|
| pop20_gen50 | 3.3 | 347 |
| pop20_gen150 | 26.7 | 2,745 |
| pop20_gen250 | 38.3 | 3,960 |
| pop20_gen325 | 56.0 | 5,754 |
Observation: Learning is slow. Still improving at 325 generations - no plateau reached.
| Experiment | Final Apples | Final Fitness |
|---|---|---|
| pop50_gen50 | 1.7 | 175 |
| pop50_gen150 | 27.7 | 2,851 |
| pop50_gen250 | 56.7 | 5,830 |
| pop50_gen325 | 66.3 | 6,822 |
Observation: Performs worse than Pop20 at gen50, but catches up later. Still no plateau at 325 generations.
| Experiment | Final Apples | Final Fitness |
|---|---|---|
| pop100_gen50 | 31 | 3,206 |
| pop100_gen150 | 74 | 7,665 |
| pop100_gen250 | 82 | 8,478 |
| pop100_gen325 | 82 | 8,478 |
Observation: Learns dramatically faster. Plateaus at ~82 apples around generation 250.
| Experiment | Final Apples | Final Fitness |
|---|---|---|
| pop150_gen50 | 28.3 | 2,929 |
| pop150_gen150 | 29.7 | 3,041 |
| pop150_gen250 | 37.0 | 3,807 |
| pop150_gen325 | 39.3 | 4,042 |
Observation: SURPRISING! Pop150 has a big early spike at Gen 11 (~27 apples) but then improves very slowly. Even at 325 generations it only reaches 39 apples - way worse than Pop100's 82 apples!
| Generations | Pop20 | Pop50 | Pop100 | Pop150 | Winner |
|---|---|---|---|---|---|
| 50 | 3.3 | 1.7 | 31 | 28.3 | Pop100 |
| 150 | 26.7 | 27.7 | 74 | 29.7 | Pop100 |
| 250 | 38.3 | 56.7 | 82 | 37.0 | Pop100 |
| 325 | 56.0 | 66.3 | 82 | 39.3 | Pop100 |
-
Pop100 is the sweet spot for population: Neither too small (slow learning) nor too large (premature convergence).
-
250 generations is sufficient for Pop100: Performance plateaus at ~82 apples around generation 250 - additional generations (325) provide no improvement.
-
Too large population can be bad: Pop150 improves very slowly after an early spike, reaching only 39 apples at 325 generations - way worse than Pop100's 82 apples. This demonstrates a well-known GA phenomenon called "loss of selection pressure" - with too many individuals, each genome has less relative impact, weakening the "survival of the fittest" mechanism. The population drifts rather than climbs toward optimal solutions.
-
Small populations need more generations: Pop20 and Pop50 keep improving at 325 generations but never catch up to Pop100, suggesting they would need significantly more generations to converge.
-
Diminishing returns on both axes: Beyond the optimal point, adding more population or more generations wastes computational resources without improving performance.
Bottom line: The optimal configuration is Population 100 with ~250 generations - this achieves peak performance (82 apples) with minimal computational cost.
Does training in more complex environments (with obstacles) help agents perform better when tested in simpler environments, and vice versa?
We trained agents in four different environment types - from classic Snake (no obstacles) to increasingly challenging obstacle modes. Then we tested each trained agent across ALL environment types to measure how well skills transfer between different conditions.
We hypothesize that:
- Asymmetric Transfer: Agents trained with obstacles will adapt better to obstacle-free environments than the reverse (complex to simple transfers better than simple to complex).
- Dynamic vs Static: Moving obstacles (Rotating, Aggregating) will produce more robust agents than fixed obstacles, since agents must learn general avoidance rather than memorizing specific positions.
- Seed Sensitivity: Static obstacle positions may cause high variance between training runs, as some random seeds create easier/harder configurations.
| Mode | Description |
|---|---|
| Baseline Control | Classic Snake without obstacles - control condition |
| Static | 3 fixed obstacles placed at the start of each episode |
| Rotating | A rolling window of 3 obstacles (oldest removed when new spawns) |
| Aggregating | Obstacles accumulate dynamically (1 per apple eaten) |
See The Game section for visualizations of each mode.
| Variable | Values |
|---|---|
| Independent | Training environment: Baseline, Static, Rotating, Aggregating |
| Dependent | Mean apples eaten, transfer performance across environments |
| Controlled | Population size (100), generations (250), network architecture |
Training Protocol:
For each of the 4 environment modes, we trained 12 separate agents using different random seeds. This ensures our results reflect the true capability of each training mode rather than lucky/unlucky seed effects. Results shown are averages across all 12 runs per mode.
| Parameter | Value |
|---|---|
| Training Runs | 12 independent seeds per environment mode (48 total) |
| Population Size | 100 genomes per generation |
| Training Duration | 250 generations per experiment |
| Evaluation | 5 episodes per genome during training |
| Environment | 12×12 grid |
| Fitness Function | Balanced: Fitness = (Apples × 50) + (Steps × 1.0) |
Testing Protocol:
Each of the 48 trained agents was tested on ALL 4 environment modes (including modes different from training), creating a 4×4 transfer matrix. Total: 38,400 test episodes across 192 train/test combinations.
Mean ± Standard Deviation across 12 experiments (Performance measured in apples per episode):
| Training Mode | Baseline Control | Static | Rotating | Aggregating | Avg. Performance | Seed CV (%) |
|---|---|---|---|---|---|---|
| Baseline Control | 37.1 ±7.4 | 4.4 ±1.3 | 5.1 ±0.9 | 4.4 ±0.5 | 12.75 | 16.2% |
| Static | 21.5 ±10.8 | 4.2 ±1.9 | 4.9 ±2.2 | 4.1 ±1.8 | 8.67 | 46.1% |
| Rotating | 22.9 ±4.9 | 5.0 ±1.6 | 5.9 ±1.2 | 4.7 ±0.8 | 9.62 | 19.2% |
| Aggregating | 16.7 ±7.4 | 4.5 ±2.2 | 5.0 ±1.6 | 4.2 ±1.0 | 7.63 | 37.8% |
CV (Coefficient of Variation) measures training stability across seeds. Lower is more reliable.
Percentage of death causes when agents play in their trained environment:
| Training Mode | Body Collision | Wall Collision | Obstacle Collision | Starvation | Timeout |
|---|---|---|---|---|---|
| Baseline Control | 73.8% | 12.0% | 0.0% | 2.0% | 12.2% |
| Static | 12.1% | 9.5% | 13.2% | 65.2% | 0.0% |
| Rotating | 14.8% | 7.5% | 14.9% | 62.9% | 0.0% |
| Aggregating | 16.4% | 9.7% | 16.5% | 57.4% | 0.0% |
Baseline agents fail by self-collision (greedy behavior), obstacle-trained agents fail by starvation (cautious behavior).
- Positive Transfer (Complexity → Simplicity): Agents trained on any obstacle mode retained 45–62% of baseline performance when tested on clean maps.
- Negative Transfer (Simplicity → Complexity): Baseline agents suffered catastrophic failure, retaining only 12–14% performance when obstacles introduced.
-
Diagonal Dominance: Baseline Control dominates its own environment (37.1 apples) but fails catastrophically on any obstacle mode (~4.4 apples, 88% drop).
-
Cross-Obstacle Transfer is Weak: All obstacle-trained agents perform similarly poorly (~4-6 apples) on modes different from their training.
-
Static Mode Instability: CV of 46.1% indicates extreme seed variance-some seeds produce near-failure, others succeed.
-
Rotating Mode Reliability: Best stability among obstacle modes (19.2% CV) with consistent mid-tier generalization.
-
Death Pattern Divergence: Baseline agents fail by self-collision (73.8%) indicating greedy behavior. Obstacle-trained agents fail by starvation (57-65%) indicating overly cautious navigation.
-
Static Obstacles Create Deceptive Fitness Landscapes: Static mode's extreme instability (46.1% CV) occurs because fixed obstacle positions allow the GA to discover "lucky" paths that work for specific coordinates rather than learning obstacle avoidance as a skill.
-
Dynamic Environments Force True Learning: Rotating obstacles achieved the most reliable training (19.2% CV) because the environment changes within each episode. Agents must develop functional understanding of spatial danger rather than memorizing paths.
-
Asymmetric Transfer Reveals Strategy Differences: The 45–62% positive transfer (Obstacle → Baseline) versus 12–14% negative transfer (Baseline → Obstacle) demonstrates that obstacle training develops genuine spatial reasoning, while baseline training produces pure optimization shortcuts.
-
Cross-Obstacle Transfer Weakness Indicates Mode-Specific Adaptation: Agents don't learn universal "avoid objects" behavior. They develop mode-specific heuristics: Static agents navigate fixed patterns, Rotating agents time movements between spawns, Aggregating agents learn escalating caution.
How do different fitness functions shape the behavior and performance of evolved Snake agents?
Fitness Strategies:
| Strategy | Formula | Goal |
|---|---|---|
| Collector | apples × 100 + steps × 0.1 |
Maximize apple collection |
| Survivor | apples × 50 + steps × 1.0 |
Balance survival and apples |
| Hungry | apples × 100 - steps × 2.0 |
Maximize eating efficiency |
We hypothesize that:
- High Apple Reward Strategy (Collector) will maximize total apples eaten while disregarding survival time and efficiency metrics
- Balanced Reward Strategy (Survivor) will create cautious, long-surviving agents that balance apple collection with extended survival time
- Time Penalty Strategy (Hungry) will create aggressive, efficiency-focused agents that minimize steps per apple
| Variable | Values |
|---|---|
| Independent | Fitness function: Collector, Survivor, Hungry |
| Dependent | Apples eaten, steps survived, efficiency (steps/apple) |
| Controlled | Population size (150), generations (200), network architecture |
Training Protocol:
To ensure statistically robust results, we structured our experiment as follows:
- Three training seed groups: We selected 3 different random seeds (1264, 4242, 7777) to create different training conditions
- Three fitness types per seed: For each training seed, we trained 3 separate agents - one for each fitness strategy (Collector, Survivor, Hungry)
- Nine total configurations: This gave us 9 training runs total (3 seeds × 3 fitness types)
Evaluation Protocol:
After training, each of the 9 trained agents was evaluated on 200 unseen test episodes (using test seeds 10000-10199) to measure real-world performance. We then averaged results across the 3 runs for each fitness type to obtain the final performance metrics.
Performance Summary:
| Strategy | Apples Collected | Steps Survived | Steps/Apple |
|---|---|---|---|
| Collector | 47.96 ± 17.20 | 1643.8 ± 588.0 | 34.3 |
| Survivor | 46.65 ± 30.80 | 1752.8 ± 1095.5 | 37.6 |
| Hungry | 35.66 ± 11.35 | 921.3 ± 294.0 | 25.8 |
-
Fitness Engineering Works: Each strategy successfully optimized for its target metric - Collector maximized apples (47.96), Survivor maximized survival time (1752.8 steps), Hungry maximized efficiency (25.8 steps/apple).
-
No Free Lunch: Optimizing for one metric comes at a cost. Hungry's efficiency (25.8 steps/apple) sacrifices total performance (35.66 apples vs 47.96).
-
Time Penalties Create Aggression AND Consistency: The harsh penalty (×2.0) in Hungry strategy created agents that eat fast (25.8 steps/apple) and die young (921 steps), but surprisingly are the MOST consistent (±11.35 apples variance).
-
Surprising Result - Balanced Rewards ≠ Consistency: We hypothesized balanced rewards would create stable agents, but Survivor shows the HIGHEST game variance (±30.80 apples). The harsh time penalty in Hungry actually created the most consistent behavior.
-
Multi-Objective Trade-offs: No strategy wins all metrics. Choose based on your objective.
How does the diversity of training environments (number of training seeds) affect the agent's ability to generalize to unseen environments?
We hypothesize that:
- Low Diversity will result in Overfitting: The agent will achieve high scores on the training seed but will achieve lower scores on new seeds.
- High Diversity will result in Generalization: The agent will perform consistently well on both training and unseen test seeds.
| Variable | Values |
|---|---|
| Independent | Number of training seeds: 1, 5, 15 |
| Dependent | Train fitness, test fitness (generalization) |
| Controlled | Population size (80), generations (120), network architecture |
Procedure: We trained 3 separate agents for each category (1, 5, and 15 seeds). The results below represent the average performance across these runs.
- Train Fitness: Evaluated on the specific seeds used during training.
- Test Fitness: Evaluated on 100 new random seeds.
| Training Diversity | Train Fitness | Test Fitness |
|---|---|---|
| 1 Seed | 37.3 | 18.2 |
| 5 Seeds | 29.5 | 23.5 |
| 15 Seeds | 28.4 | 26.9 |
- Results Match Expectations: The data confirms our hypothesis. There is a clear trade-off between peak performance on known environments and stability on unseen environments.
- Overfitting: The agent trained on a single seed exhibited severe overfitting. It achieved high performance on its training environment (37.3) but failed to transfer this performance to new environments, resulting in significantly lower test fitness (18.2).
- Robustness and Convergence: Training on a larger set of seeds (15) encouraged the emergence of environment-agnostic behaviors rather than seed-specific strategies. Consequently, the generalization gap largely disappeared, with train and test performance converging (28.4 vs. 26.9), indicating a more stable and robust policy.
This project demonstrates that Genetic Algorithms can effectively evolve Neural Network controllers for Snake without labeled data, producing agents that learn food-seeking behavior and survival through selection pressure alone.
Across experiments, we show that training outcomes are strongly shaped by core design choices: compute budget (population size and generations), environment difficulty (obstacle modes), and whether training transfers to more complex settings.
Fitness engineering meaningfully changes behavior: reward structures can push agents toward maximizing apples, surviving longer, or improving efficiency, but the results highlight consistent trade-offs rather than a single universally best objective.
Finally, training diversity is critical for real-world reliability: single-seed training can overfit and fail to generalize, while increasing the number of training seeds produces more robust, environment-agnostic policies on unseen episodes.













