Skip to content

Trains AI agents to play Snake using a Genetic Algorithm evolving a Neural Network policy, with experiments on fitness design, obstacle-based environment complexity, and generalization across random seeds, plus pygame-based visualization.

Notifications You must be signed in to change notification settings

LiorLotan2/SnakeAgentPlayer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snake Agent Player - Genetic Algorithm Training

Itay Shaul, Lior Lotan, Ben Kapon, Ori Cohen

Table of Contents

  1. Introduction
  2. The Game
  3. Genetic Algorithm (GA)
  4. Code Overview
  5. Important Functions
  6. How to Run
  7. Framing the Problem
  8. Experiments
  9. Conclusion

Introduction

This project trains AI agents to play the classic Snake game using Genetic Algorithms (GA) combined with Neural Networks. The goal is to evolve neural network weights through a process inspired by natural selection, where the fittest agents (those that eat the most apples and survive longest) pass their "genes" (network weights) to the next generation.

The Snake game presents an interesting challenge for AI: the agent must learn to navigate toward food while avoiding collisions with walls, its own body, and (in advanced modes) obstacles. Unlike supervised learning where we provide correct answers, the genetic algorithm discovers effective strategies through evolutionary pressure alone.

Our project explores four key dimensions that affect learning performance:

  • Population size and generation count and their impact on convergence speed
  • Environmental complexity (obstacle modes) and transfer learning
  • Fitness function design and agent behaviors
  • Training Diversity and generalization

The Game

Classic Snake Mechanics

The Snake game is played on a grid where the player controls a snake that moves in one of four directions (UP, RIGHT, DOWN, LEFT). The objective is to eat apples that appear randomly on the grid. Each apple eaten causes the snake to grow longer. The game ends when the snake collides with:

  • The wall (boundary of the grid)
  • Its own body
  • An obstacle (in twist mode)

Game Modes

We implement two primary modes:

Mode Description
Baseline Classic Snake without obstacles - the control condition for experiments
Twist Snake with obstacles that add environmental complexity

The twist mode supports three obstacle policies:

Obstacle Mode Description Visualization
Static 3 fixed obstacles placed at the start of each episode Static
Rotating A rolling window of 3 obstacles (oldest removed when new spawns) Rotating
Aggregating Obstacles accumulate dynamically (1 per apple eaten) Aggregating

Training Progress Visualization

The following GIFs demonstrate the evolution of agent performance across training generations:

Generation 1 - Initial Random Behavior (0 apples):

Generation 1

The untrained agent moves randomly with no learned behavior, quickly colliding with a wall after just 12 steps.

Generation 50 - Intermediate Learning (10 apples):

Generation 100

After 50 generations of evolution, the agent has learned basic food-seeking behavior. However, it gets stuck in loops, circling without making progress toward food.

Generation 250 - Final Trained Agent (141 apples - Victory!):

Generation 250

The fully trained agent demonstrates sophisticated navigation, efficiently collecting apples while avoiding collisions. In this run, the agent achieved victory by eating all apples and filling the entire board!


Genetic Algorithm (GA)

Overview

The Genetic Algorithm is a method for solving optimization problems based on natural selection, the process that drives biological evolution. The core idea is that "the strong survive" - by maintaining a population of candidate solutions and iteratively selecting the best performers, we can evolve increasingly effective solutions over generations.

Algorithm Outline

flowchart TD
    A[Initialize Random Population] --> B[Evaluate Fitness]
    B --> C[Select Elite Individuals]
    C --> D[Tournament Selection]
    D --> E[Crossover - Combine Parents]
    E --> F[Mutation - Add Variation]
    F --> G{Max Generations?}
    G -->|No| B
    G -->|Yes| H[Return Best Genome]
Loading

Key Components

  1. Genome Representation: Each agent's neural network weights are encoded as a flat array (genome). The network has:

    • Input layer: 11 features (danger sensors, heading, food direction)
    • Hidden layer: Configurable size (default 16 neurons)
    • Output layer: 3 actions (turn left, go straight, turn right)
  2. Fitness Evaluation: Each genome is tested over multiple episodes. Fitness is computed based on:

    • Apples eaten (primary objective)
    • Steps survived (secondary objective)
    • Custom weighting allows different agent behaviors
  3. Selection: Tournament selection chooses parents - k random individuals compete, and the fittest wins the right to reproduce.

  4. Crossover: Uniform crossover combines two parent genomes - each gene is randomly taken from either parent.

  5. Mutation: Gaussian mutation adds random noise to genes with a configurable probability and magnitude.

  6. Elitism: The top-performing individuals are preserved unchanged to prevent losing good solutions.


Code Overview

Project Structure

SnakeAgentPlayer/
├── env/                        # Snake game environment
│   ├── snake_env.py            # Core Snake environment with O(1) collision detection
│   ├── obstacles.py            # Obstacle spawning policies (Static, Rotating, Aggregating)
│   └── rewards.py              # Reward shaping functions
│
├── agent/                      # Neural network agent
│   ├── encoder.py              # State encoder - converts game state to feature vector (11-dim)
│   ├── nn_policy.py            # Neural network policy backed by genome weights
│   └── policy.py               # Policy interface and baseline policies (Random, Greedy)
│
├── ga/                         # Genetic algorithm
│   ├── config.py               # GA hyperparameters configuration
│   ├── evaluator.py            # Fitness evaluators (Collector, Survivor, Hungry strategies)
│   ├── operators.py            # Genetic operators (selection, crossover, mutation)
│   └── ga.py                   # Main GA class orchestrating the evolution loop
│
├── rendering/                  # Visualization
│   └── visual_renderer.py      # Pygame-based renderer with modern graphics
│
├── scripts/                    # Training and testing utilities
│   ├── train.py                # Training entry point
│   ├── play.py                 # Agent playback tool
│   ├── automated_training.py   # Full training pipeline with checkpointing
│   ├── test_genome.py          # Genome evaluation and visual playback
│
├── experiments/                # Experiment utilities
│   ├── config_loader.py        # YAML configuration loading
│   ├── logging.py              # Training progress logging
│   └── plotting.py             # Results visualization
│
├── images/                     # Experiment result images and GIFs
├── setup.py                    # Package installation configuration
├── Makefile                    # Command shortcuts
└── requirements.txt            # Python dependencies

Important Functions

Genetic Algorithm Functions:

  • GeneticAlgorithm.run(): Execute the evolutionary loop and return the best genome found.
  • tournament_select(): Select a parent genome via tournament selection.
  • uniform_crossover(): Create offspring genomes by mixing two parents (uniform crossover).
  • gaussian_mutate(): Mutate a genome by adding Gaussian noise to individual genes.

Evaluation and Fitness Functions:

  • SnakeEvaluator.evaluate_genome(): Run a genome for multiple episodes and compute its fitness plus performance metrics.
  • BalancedFitnessEvaluator: Fitness variant that emphasizes survival (used for the “Survivor” strategy).
  • HungryFitnessEvaluator: Fitness variant that emphasizes step efficiency (used for the “Hungry” strategy).
  • TwistModeEvaluator: Twist-mode evaluator that averages performance across more episodes for obstacle environments.

Environment Functions:

  • SnakeEnv.reset(): Reset the environment to a new episode (optionally with a deterministic seed).
  • SnakeEnv.step(): Advance the game by one action and return (state, reward, done, info).
  • AggregatingObstaclePolicy.on_food_eaten(): Add new obstacles after each apple (obstacles accumulate).
  • RotatingObstaclePolicy.on_food_eaten(): Add new obstacles while keeping a maximum number (oldest rotates out).
  • StaticObstaclePolicy.on_reset(): Spawn a fixed set of obstacles at the start of an episode.
  • DefaultRewardFn.reward(): Default reward shaping (+apple reward, optional step reward, death penalty).

Agent and Policy Functions:

  • Encoder.encode(): Convert an EnvState into an 11-dimensional feature vector for the neural network.
  • NeuralNetPolicy.act(): Choose an action (left/straight/right) from encoded features using a feedforward network.
  • genome_size(): Compute the parameter count required for a given NetworkArch.
  • random_genome(): Initialize a random genome with scaled (Xavier-like) weights.
  • unpack_genome(): Unpack a flat genome into (W1, b1, W2, b2) tensors.

Rendering and Visualization Functions:

  • SnakeRenderer.render(): Render the current game state with pygame.
  • test_genome(): Load a genome + config and run evaluation episodes (optionally with visualization).

Training and CLI Entry Points:

  • AutomatedTrainingRunner.run(): End-to-end training pipeline with checkpointing, metrics, and best-genome saving.
  • scripts/train.py:main(): CLI wrapper that runs AutomatedTrainingRunner for a given YAML config.
  • scripts/play.py:main(): CLI wrapper that loads a saved genome and calls test_genome() for playback.

Data Files:

  • configs/*.yaml: Training configurations (environment + GA hyperparameters).
  • runs/<run_name>/*.npy: Saved genomes (checkpoints like gen_<N>.npy and best_genome.npy).
  • runs/<run_name>/metrics.csv: Per-generation training metrics.
  • runs/<run_name>/checkpoints_metadata.json: Metadata for saved checkpoints (fitness, apples, steps, death reasons).
  • runs/<run_name>/config_used.yaml: Exact configuration snapshot used for that run.

How to Run

Prerequisites

  • Python 3.11 or 3.12 (required)
  • Python 3.13+ may have compatibility issues with pygame on Windows

Installation

Mac/Linux:

git clone <repository-url>
cd SnakeAgentPlayer
python3 -m venv venv
source venv/bin/activate
make install

Windows:

git clone <repository-url>
cd SnakeAgentPlayer
python -m venv venv
venv\Scripts\activate
make install

After installation, run make help to see all available commands.

Training

make train CONFIG=configs/winner.yaml

The training will output:

  • runs/<experiment_name>/best_genome.npy - Trained agent weights
  • runs/<experiment_name>/config_used.yaml - Configuration used
  • runs/<experiment_name>/metrics.csv - Training statistics
  • runs/<experiment_name>/training_progress.png - Performance plot

Watching Trained Agents

# Watch the latest trained agent
make play

# Watch a specific experiment
make play RUN=winner

# Customize playback
make play RUN=winner EPISODES=10 FPS=20

Advanced CLI Options

Command Description
make help Show all available commands and options
make install Install package and dependencies
make train CONFIG=<file> Train with specific config file
make play [OPTIONS] Watch trained agent
make clean Remove training runs and cache

Play Options (all optional, smart defaults):

Option Description
RUN=<name> Experiment name (default: latest run)
GENOME=<file> Genome to play (default: best_genome.npy from RUN)
CONFIG=<file> Environment config (default: config_used.yaml from RUN)
TWIST=<mode> Twist mode (overrides config if set)
aggregating - Obstacles accumulate over time
rotating - Max 3 obstacles, oldest rotates out
static - 3 random fixed obstacles
EPISODES=<n> Number of episodes (default: 3)
FPS=<n> Playback speed (default: 10)
VISUAL=0 Disable visualization for faster testing (default: enabled)

Run make help to see usage examples.

Configuration Structure

env:
  width: 12                  # Grid width (cells)
  height: 12                 # Grid height (cells)
  twist: false               # Enable twist mode with obstacles (boolean)
  max_steps: 2000            # Optional: max steps per episode (timeout)
  starvation_steps: 200      # Optional: steps without food before starvation

  # Used only when twist: true (defaults shown below)
  obstacle_policy:
    mode: aggregating        # aggregating | rotating | static
    min_distance_from_head: 3
    obstacles_per_food: 1    # aggregating-only
    max_obstacles: 3         # rotating-only
    num_obstacles: 3         # static-only

agent:
  network:
    hidden_dim: 16           # Neural network hidden layer size

ga:
  pop_size: 50              # Population size
  generations: 200          # Number of generations
  episodes_per_genome: 5    # Episodes to evaluate each genome (averaged)
  seed: 123                 # Random seed (reproducibility)

  # Fitness function type (optional; defaults to "collector")
  # Supported: collector (aka greedy), survivor (aka balanced), hungry
  fitness_type: collector

  # GA operator hyperparameters (all optional; defaults shown below)
  elite_frac: 0.05
  tournament_k: 5
  crossover_rate: 0.7
  mutation_rate: 0.1
  mutation_sigma: 0.2

logging:
  output_dir: runs           # Where training runs are saved

Framing the Problem

Our research investigates how different parameters and design choices affect the learning performance of genetic algorithms in the Snake game domain. We frame this as four distinct experimental questions:

  1. Population Size and Generation Count: What is the optimal combination of population size and number of generations for balancing exploration, convergence speed, and final performance?

  2. Environmental Complexity and Transfer Learning: How does training environment complexity (obstacle modes) affect robustness, seed sensitivity, and transfer learning across different configurations?

  3. Reward Shaping and Emergent Behavior: How do different fitness functions shape the behavior and performance of evolved Snake agents?

  4. Overfitting vs. Generalization: How does the diversity of training environments (number of training seeds) affect the agent's ability to generalize to unseen environments?

Each experiment is designed with:

  • Independent variable: The parameter being tested
  • Dependent variable: Performance metrics (apples eaten, survival time)
  • Controlled variables: All other parameters held constant

Experiments

Experiment 1: Population Size and Generation Count Impact

Research Question

What is the optimal balance between population size and generation count for maximizing learning speed and final performance?

We hypothesize that:

  • Larger populations will learn faster due to increased genetic diversity
  • More generations allow for continued improvement, but with diminishing returns
  • There exists an optimal combination where adding more generations or population provides no benefit
  • Smaller populations might need more generations to converge

Experiment Design

Variable Values
Independent Population size: 20, 50, 100, 150
Independent Generation count: 50, 150, 250, 325
Dependent Mean apples eaten by best agent
Controlled All other GA parameters (seed, mutation rate, etc.)

We test all combinations of population sizes and generation counts to find the optimal balance.

Total: 16 experiments

Note: Each configuration was tested across multiple random seeds, and results were averaged to ensure statistical reliability and reduce variance from lucky/unlucky initial conditions.

Results

Population 20

Population 20 Comparison

Experiment Final Apples Final Fitness
pop20_gen50 3.3 347
pop20_gen150 26.7 2,745
pop20_gen250 38.3 3,960
pop20_gen325 56.0 5,754

Observation: Learning is slow. Still improving at 325 generations - no plateau reached.

Population 50

Population 50 Comparison

Experiment Final Apples Final Fitness
pop50_gen50 1.7 175
pop50_gen150 27.7 2,851
pop50_gen250 56.7 5,830
pop50_gen325 66.3 6,822

Observation: Performs worse than Pop20 at gen50, but catches up later. Still no plateau at 325 generations.

Population 100

Population 100 Comparison

Experiment Final Apples Final Fitness
pop100_gen50 31 3,206
pop100_gen150 74 7,665
pop100_gen250 82 8,478
pop100_gen325 82 8,478

Observation: Learns dramatically faster. Plateaus at ~82 apples around generation 250.

Population 150

Population 150 Comparison

Experiment Final Apples Final Fitness
pop150_gen50 28.3 2,929
pop150_gen150 29.7 3,041
pop150_gen250 37.0 3,807
pop150_gen325 39.3 4,042

Observation: SURPRISING! Pop150 has a big early spike at Gen 11 (~27 apples) but then improves very slowly. Even at 325 generations it only reaches 39 apples - way worse than Pop100's 82 apples!

Comparison Summary

Generations Pop20 Pop50 Pop100 Pop150 Winner
50 3.3 1.7 31 28.3 Pop100
150 26.7 27.7 74 29.7 Pop100
250 38.3 56.7 82 37.0 Pop100
325 56.0 66.3 82 39.3 Pop100

Conclusions

  1. Pop100 is the sweet spot for population: Neither too small (slow learning) nor too large (premature convergence).

  2. 250 generations is sufficient for Pop100: Performance plateaus at ~82 apples around generation 250 - additional generations (325) provide no improvement.

  3. Too large population can be bad: Pop150 improves very slowly after an early spike, reaching only 39 apples at 325 generations - way worse than Pop100's 82 apples. This demonstrates a well-known GA phenomenon called "loss of selection pressure" - with too many individuals, each genome has less relative impact, weakening the "survival of the fittest" mechanism. The population drifts rather than climbs toward optimal solutions.

  4. Small populations need more generations: Pop20 and Pop50 keep improving at 325 generations but never catch up to Pop100, suggesting they would need significantly more generations to converge.

  5. Diminishing returns on both axes: Beyond the optimal point, adding more population or more generations wastes computational resources without improving performance.

Bottom line: The optimal configuration is Population 100 with ~250 generations - this achieves peak performance (82 apples) with minimal computational cost.


Experiment 2: Transfer Learning Across Environmental Complexity

Research Question

Does training in more complex environments (with obstacles) help agents perform better when tested in simpler environments, and vice versa?

We trained agents in four different environment types - from classic Snake (no obstacles) to increasingly challenging obstacle modes. Then we tested each trained agent across ALL environment types to measure how well skills transfer between different conditions.

We hypothesize that:

  • Asymmetric Transfer: Agents trained with obstacles will adapt better to obstacle-free environments than the reverse (complex to simple transfers better than simple to complex).
  • Dynamic vs Static: Moving obstacles (Rotating, Aggregating) will produce more robust agents than fixed obstacles, since agents must learn general avoidance rather than memorizing specific positions.
  • Seed Sensitivity: Static obstacle positions may cause high variance between training runs, as some random seeds create easier/harder configurations.

Training Environments

Mode Description
Baseline Control Classic Snake without obstacles - control condition
Static 3 fixed obstacles placed at the start of each episode
Rotating A rolling window of 3 obstacles (oldest removed when new spawns)
Aggregating Obstacles accumulate dynamically (1 per apple eaten)

See The Game section for visualizations of each mode.

Experiment Design

Variable Values
Independent Training environment: Baseline, Static, Rotating, Aggregating
Dependent Mean apples eaten, transfer performance across environments
Controlled Population size (100), generations (250), network architecture

Training Protocol:

For each of the 4 environment modes, we trained 12 separate agents using different random seeds. This ensures our results reflect the true capability of each training mode rather than lucky/unlucky seed effects. Results shown are averages across all 12 runs per mode.

Parameter Value
Training Runs 12 independent seeds per environment mode (48 total)
Population Size 100 genomes per generation
Training Duration 250 generations per experiment
Evaluation 5 episodes per genome during training
Environment 12×12 grid
Fitness Function Balanced: Fitness = (Apples × 50) + (Steps × 1.0)

Testing Protocol:

Each of the 48 trained agents was tested on ALL 4 environment modes (including modes different from training), creating a 4×4 transfer matrix. Total: 38,400 test episodes across 192 train/test combinations.

Results

Aggregated Performance Matrix

Mean ± Standard Deviation across 12 experiments (Performance measured in apples per episode):

Training Mode Baseline Control Static Rotating Aggregating Avg. Performance Seed CV (%)
Baseline Control 37.1 ±7.4 4.4 ±1.3 5.1 ±0.9 4.4 ±0.5 12.75 16.2%
Static 21.5 ±10.8 4.2 ±1.9 4.9 ±2.2 4.1 ±1.8 8.67 46.1%
Rotating 22.9 ±4.9 5.0 ±1.6 5.9 ±1.2 4.7 ±0.8 9.62 19.2%
Aggregating 16.7 ±7.4 4.5 ±2.2 5.0 ±1.6 4.2 ±1.0 7.63 37.8%

CV (Coefficient of Variation) measures training stability across seeds. Lower is more reliable.

Death Reason Analysis

Percentage of death causes when agents play in their trained environment:

Training Mode Body Collision Wall Collision Obstacle Collision Starvation Timeout
Baseline Control 73.8% 12.0% 0.0% 2.0% 12.2%
Static 12.1% 9.5% 13.2% 65.2% 0.0%
Rotating 14.8% 7.5% 14.9% 62.9% 0.0%
Aggregating 16.4% 9.7% 16.5% 57.4% 0.0%

Baseline agents fail by self-collision (greedy behavior), obstacle-trained agents fail by starvation (cautious behavior).

Transfer Learning Analysis

Transfer Learning

  • Positive Transfer (Complexity → Simplicity): Agents trained on any obstacle mode retained 45–62% of baseline performance when tested on clean maps.
  • Negative Transfer (Simplicity → Complexity): Baseline agents suffered catastrophic failure, retaining only 12–14% performance when obstacles introduced.

Key Observations

  1. Diagonal Dominance: Baseline Control dominates its own environment (37.1 apples) but fails catastrophically on any obstacle mode (~4.4 apples, 88% drop).

  2. Cross-Obstacle Transfer is Weak: All obstacle-trained agents perform similarly poorly (~4-6 apples) on modes different from their training.

  3. Static Mode Instability: CV of 46.1% indicates extreme seed variance-some seeds produce near-failure, others succeed.

  4. Rotating Mode Reliability: Best stability among obstacle modes (19.2% CV) with consistent mid-tier generalization.

  5. Death Pattern Divergence: Baseline agents fail by self-collision (73.8%) indicating greedy behavior. Obstacle-trained agents fail by starvation (57-65%) indicating overly cautious navigation.

Conclusions

  1. Static Obstacles Create Deceptive Fitness Landscapes: Static mode's extreme instability (46.1% CV) occurs because fixed obstacle positions allow the GA to discover "lucky" paths that work for specific coordinates rather than learning obstacle avoidance as a skill.

  2. Dynamic Environments Force True Learning: Rotating obstacles achieved the most reliable training (19.2% CV) because the environment changes within each episode. Agents must develop functional understanding of spatial danger rather than memorizing paths.

  3. Asymmetric Transfer Reveals Strategy Differences: The 45–62% positive transfer (Obstacle → Baseline) versus 12–14% negative transfer (Baseline → Obstacle) demonstrates that obstacle training develops genuine spatial reasoning, while baseline training produces pure optimization shortcuts.

  4. Cross-Obstacle Transfer Weakness Indicates Mode-Specific Adaptation: Agents don't learn universal "avoid objects" behavior. They develop mode-specific heuristics: Static agents navigate fixed patterns, Rotating agents time movements between spawns, Aggregating agents learn escalating caution.


Experiment 3: Fitness Function Strategies

Research Question

How do different fitness functions shape the behavior and performance of evolved Snake agents?

Fitness Strategies:

Strategy Formula Goal
Collector apples × 100 + steps × 0.1 Maximize apple collection
Survivor apples × 50 + steps × 1.0 Balance survival and apples
Hungry apples × 100 - steps × 2.0 Maximize eating efficiency

We hypothesize that:

  • High Apple Reward Strategy (Collector) will maximize total apples eaten while disregarding survival time and efficiency metrics
  • Balanced Reward Strategy (Survivor) will create cautious, long-surviving agents that balance apple collection with extended survival time
  • Time Penalty Strategy (Hungry) will create aggressive, efficiency-focused agents that minimize steps per apple

Experiment Design

Variable Values
Independent Fitness function: Collector, Survivor, Hungry
Dependent Apples eaten, steps survived, efficiency (steps/apple)
Controlled Population size (150), generations (200), network architecture

Training Protocol:

To ensure statistically robust results, we structured our experiment as follows:

  1. Three training seed groups: We selected 3 different random seeds (1264, 4242, 7777) to create different training conditions
  2. Three fitness types per seed: For each training seed, we trained 3 separate agents - one for each fitness strategy (Collector, Survivor, Hungry)
  3. Nine total configurations: This gave us 9 training runs total (3 seeds × 3 fitness types)

Evaluation Protocol:

After training, each of the 9 trained agents was evaluated on 200 unseen test episodes (using test seeds 10000-10199) to measure real-world performance. We then averaged results across the 3 runs for each fitness type to obtain the final performance metrics.

Results

Performance Comparison

Robustness Comparison

Performance Summary:

Strategy Apples Collected Steps Survived Steps/Apple
Collector 47.96 ± 17.20 1643.8 ± 588.0 34.3
Survivor 46.65 ± 30.80 1752.8 ± 1095.5 37.6
Hungry 35.66 ± 11.35 921.3 ± 294.0 25.8

Conclusions

  1. Fitness Engineering Works: Each strategy successfully optimized for its target metric - Collector maximized apples (47.96), Survivor maximized survival time (1752.8 steps), Hungry maximized efficiency (25.8 steps/apple).

  2. No Free Lunch: Optimizing for one metric comes at a cost. Hungry's efficiency (25.8 steps/apple) sacrifices total performance (35.66 apples vs 47.96).

  3. Time Penalties Create Aggression AND Consistency: The harsh penalty (×2.0) in Hungry strategy created agents that eat fast (25.8 steps/apple) and die young (921 steps), but surprisingly are the MOST consistent (±11.35 apples variance).

  4. Surprising Result - Balanced Rewards ≠ Consistency: We hypothesized balanced rewards would create stable agents, but Survivor shows the HIGHEST game variance (±30.80 apples). The harsh time penalty in Hungry actually created the most consistent behavior.

  5. Multi-Objective Trade-offs: No strategy wins all metrics. Choose based on your objective.


Experiment 4: Impact of Training Diversity on Generalization

Research Question

How does the diversity of training environments (number of training seeds) affect the agent's ability to generalize to unseen environments?

Hypothesis

We hypothesize that:

  1. Low Diversity will result in Overfitting: The agent will achieve high scores on the training seed but will achieve lower scores on new seeds.
  2. High Diversity will result in Generalization: The agent will perform consistently well on both training and unseen test seeds.

Experiment Design

Variable Values
Independent Number of training seeds: 1, 5, 15
Dependent Train fitness, test fitness (generalization)
Controlled Population size (80), generations (120), network architecture

Procedure: We trained 3 separate agents for each category (1, 5, and 15 seeds). The results below represent the average performance across these runs.

  • Train Fitness: Evaluated on the specific seeds used during training.
  • Test Fitness: Evaluated on 100 new random seeds.

Results

Impact of Training Diversity on Generalization

Training Diversity Train Fitness Test Fitness
1 Seed 37.3 18.2
5 Seeds 29.5 23.5
15 Seeds 28.4 26.9

Conclusions

  1. Results Match Expectations: The data confirms our hypothesis. There is a clear trade-off between peak performance on known environments and stability on unseen environments.
  2. Overfitting: The agent trained on a single seed exhibited severe overfitting. It achieved high performance on its training environment (37.3) but failed to transfer this performance to new environments, resulting in significantly lower test fitness (18.2).
  3. Robustness and Convergence: Training on a larger set of seeds (15) encouraged the emergence of environment-agnostic behaviors rather than seed-specific strategies. Consequently, the generalization gap largely disappeared, with train and test performance converging (28.4 vs. 26.9), indicating a more stable and robust policy.

Conclusion

This project demonstrates that Genetic Algorithms can effectively evolve Neural Network controllers for Snake without labeled data, producing agents that learn food-seeking behavior and survival through selection pressure alone.

Across experiments, we show that training outcomes are strongly shaped by core design choices: compute budget (population size and generations), environment difficulty (obstacle modes), and whether training transfers to more complex settings.

Fitness engineering meaningfully changes behavior: reward structures can push agents toward maximizing apples, surviving longer, or improving efficiency, but the results highlight consistent trade-offs rather than a single universally best objective.

Finally, training diversity is critical for real-world reliability: single-seed training can overfit and fail to generalize, while increasing the number of training seeds produces more robust, environment-agnostic policies on unseen episodes.

About

Trains AI agents to play Snake using a Genetic Algorithm evolving a Neural Network policy, with experiments on fitness design, obstacle-based environment complexity, and generalization across random seeds, plus pygame-based visualization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5