Skip to content

kanta8864/Tennis-Rally-Language-Model

Repository files navigation

Tennis Rally Language Model

A GPT-style transformer trained to model tennis rally sequences as a language, learning to predict the next shot given a sequence of previous shots. The model learns rich representations of player style, enabling downstream analysis like player clustering by playing patterns.


Motivation

Tennis rallies follow structured patterns — shot type, direction, court position — that can be encoded as discrete tokens, just like words in a sentence. This project applies causal language modeling to tennis rally data to ask: can a model learn the "grammar" of tennis? The learned embeddings are then used to cluster players by playing style without any manually defined features.


Dataset

Source: Tennis Match Charting Project — a crowd-sourced, open dataset of shot-by-shot match charting.

Three CSV files cover different eras:

File Period Size
charting-m-points-to-2009.csv Pre-2010 ~1.1 MB
charting-m-points-2010s.csv 2010–2019 ~36 MB
charting-m-points-2020s.csv 2020–present ~53 MB

Scale: 1,221,862 points across 7,194 matches spanning from the 1960s through the 2020s.

Each row represents one point and contains a rally encoded in the Match Charting notation — a compact string where each character or pair of characters encodes a shot event. For example, 4f39f3f3d@ means: serve wide → deep forehand return → two cross-court forehand rallies → unforced error into the net.

Preprocessing (data/dataset.py):

  • Selects the second serve column when a fault occurred, otherwise the first serve
  • Filters out points with no recognized point ending (* winner, @ unforced error, # forced error)
  • Drops sequences shorter than 3 tokens
  • 95/5 train/validation split

Tokenizer

A custom rule-based tokenizer (data/tokenizer.py) converts raw rally strings into integer token IDs. Vocabulary size is ~900 tokens.

Token categories:

Category Examples Meaning
Serve direction 4, 5, 6 Wide, body, T
Shot type f, b, v, o, u Forehand, backhand, volley, smash, drop shot
Direction 1, 2, 3 Forehand side, middle, backhand side
Position +, -, = Approach, net, baseline
Return depth 7, 8, 9 Shallow, moderate, deep
Point ending *, @, # Winner, unforced error, forced error
Error type n, w, d Net, wide, deep

The tokenizer parses left-to-right using a greedy longest-match strategy — e.g., f39 is tokenized as a single return token (forehand, backhand side, deep) rather than three individual characters.

Each sequence is wrapped with <bos> / <eos> tokens, padded to a maximum length of 64, and truncated if necessary.


Model Architecture

A GPT-2 style decoder-only transformer (models/rally_transformer.py) with causal self-attention.

Vocab size:       900
Max sequence len: 64
Hidden dim:       128
Layers:           4
Attention heads:  4
FFN dim:          512
Dropout:          0.1
Parameters:       ~2.3M

Key design choices:

  • Causal masking: Each position attends only to previous positions, enabling next-token prediction at every step
  • Pre-LayerNorm: Normalizes inputs before each sublayer for better training stability
  • Weight tying: The output projection shares weights with the token embedding matrix, reducing parameters while keeping input/output semantics aligned
  • GPT-2 initialization: Weights initialized from N(0, 0.02); residual projections scaled by 1/√(2 × n_layers)

Training

Entry point: train.py
Trainer: training/trainer.py

Hyperparameter Value
Batch size 256
Max epochs 20
Peak learning rate 3e-4
LR schedule Cosine decay with linear warmup (200 steps)
Optimizer AdamW (β₁=0.9, β₂=0.95)
Weight decay 0.1 (weights only; biases/LayerNorm excluded)
Gradient clipping 1.0

Loss: cross-entropy over next-token prediction, ignoring padding positions.
Validation runs every 500 steps; checkpoints saved every 1000 steps with best-model tracking.
Device selection is automatic: CUDA → MPS → CPU.

The model was trained to step 76,980 (full 20 epochs), producing 76 checkpoints.


Player Style Clustering

cluster_players.py uses the trained model as a feature extractor to cluster players by playing style — no labels required.

Pipeline:

  1. Load all rallies for each player from the dataset
  2. Tokenize and pass each rally through the transformer (excluding the final LM head)
  3. Mean-pool over the sequence dimension to get a fixed-size rally embedding
  4. Average embeddings across all of a player's rallies to get a single player vector
  5. Reduce to 2D with UMAP (n_neighbors=15)
  6. Cluster with k-means (k=5)
  7. Visualize and save as player_clusters.png

Filters: Players with fewer than 500 rallies are excluded; the top 100 players by rally count are used; players with more than 10,000 rallies are subsampled.

This is entirely unsupervised — the clusters emerge from patterns learned during language model training, not from any hand-crafted style labels.


Project Structure

Tennis-Rally-Language-Model/
├── config.py                  # ModelConfig and TrainConfig dataclasses
├── train.py                   # Training entry point
├── cluster_players.py         # Post-training player embedding and clustering
├── utils.py                   # Device selection and rally utilities
├── data/
│   ├── dataset.py             # Dataset class and data loading
│   ├── tokenizer.py           # Custom tennis rally tokenizer
│   └── raw/                   # Raw CSV files from Match Charting Project
├── models/
│   └── rally_transformer.py   # GPT-style transformer model
├── training/
│   └── trainer.py             # Training loop, LR schedule, checkpointing
├── checkpoints/               # Saved model checkpoints
└── tennis_notion_guide.md     # Reference for the shot notation system

Setup

pip install -r requirements.txt

Train the model:

python train.py

Run player clustering (requires a trained checkpoint):

python cluster_players.py

Dependencies

  • Python 3.x
  • PyTorch
  • NumPy
  • pandas
  • scikit-learn
  • umap-learn
  • matplotlib

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors