A GPT-style transformer trained to model tennis rally sequences as a language, learning to predict the next shot given a sequence of previous shots. The model learns rich representations of player style, enabling downstream analysis like player clustering by playing patterns.
Tennis rallies follow structured patterns — shot type, direction, court position — that can be encoded as discrete tokens, just like words in a sentence. This project applies causal language modeling to tennis rally data to ask: can a model learn the "grammar" of tennis? The learned embeddings are then used to cluster players by playing style without any manually defined features.
Source: Tennis Match Charting Project — a crowd-sourced, open dataset of shot-by-shot match charting.
Three CSV files cover different eras:
| File | Period | Size |
|---|---|---|
charting-m-points-to-2009.csv |
Pre-2010 | ~1.1 MB |
charting-m-points-2010s.csv |
2010–2019 | ~36 MB |
charting-m-points-2020s.csv |
2020–present | ~53 MB |
Scale: 1,221,862 points across 7,194 matches spanning from the 1960s through the 2020s.
Each row represents one point and contains a rally encoded in the Match Charting notation — a compact string where each character or pair of characters encodes a shot event. For example, 4f39f3f3d@ means: serve wide → deep forehand return → two cross-court forehand rallies → unforced error into the net.
Preprocessing (data/dataset.py):
- Selects the second serve column when a fault occurred, otherwise the first serve
- Filters out points with no recognized point ending (
*winner,@unforced error,#forced error) - Drops sequences shorter than 3 tokens
- 95/5 train/validation split
A custom rule-based tokenizer (data/tokenizer.py) converts raw rally strings into integer token IDs. Vocabulary size is ~900 tokens.
Token categories:
| Category | Examples | Meaning |
|---|---|---|
| Serve direction | 4, 5, 6 |
Wide, body, T |
| Shot type | f, b, v, o, u |
Forehand, backhand, volley, smash, drop shot |
| Direction | 1, 2, 3 |
Forehand side, middle, backhand side |
| Position | +, -, = |
Approach, net, baseline |
| Return depth | 7, 8, 9 |
Shallow, moderate, deep |
| Point ending | *, @, # |
Winner, unforced error, forced error |
| Error type | n, w, d |
Net, wide, deep |
The tokenizer parses left-to-right using a greedy longest-match strategy — e.g., f39 is tokenized as a single return token (forehand, backhand side, deep) rather than three individual characters.
Each sequence is wrapped with <bos> / <eos> tokens, padded to a maximum length of 64, and truncated if necessary.
A GPT-2 style decoder-only transformer (models/rally_transformer.py) with causal self-attention.
Vocab size: 900
Max sequence len: 64
Hidden dim: 128
Layers: 4
Attention heads: 4
FFN dim: 512
Dropout: 0.1
Parameters: ~2.3M
Key design choices:
- Causal masking: Each position attends only to previous positions, enabling next-token prediction at every step
- Pre-LayerNorm: Normalizes inputs before each sublayer for better training stability
- Weight tying: The output projection shares weights with the token embedding matrix, reducing parameters while keeping input/output semantics aligned
- GPT-2 initialization: Weights initialized from N(0, 0.02); residual projections scaled by 1/√(2 × n_layers)
Entry point: train.py
Trainer: training/trainer.py
| Hyperparameter | Value |
|---|---|
| Batch size | 256 |
| Max epochs | 20 |
| Peak learning rate | 3e-4 |
| LR schedule | Cosine decay with linear warmup (200 steps) |
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
| Weight decay | 0.1 (weights only; biases/LayerNorm excluded) |
| Gradient clipping | 1.0 |
Loss: cross-entropy over next-token prediction, ignoring padding positions.
Validation runs every 500 steps; checkpoints saved every 1000 steps with best-model tracking.
Device selection is automatic: CUDA → MPS → CPU.
The model was trained to step 76,980 (full 20 epochs), producing 76 checkpoints.
cluster_players.py uses the trained model as a feature extractor to cluster players by playing style — no labels required.
Pipeline:
- Load all rallies for each player from the dataset
- Tokenize and pass each rally through the transformer (excluding the final LM head)
- Mean-pool over the sequence dimension to get a fixed-size rally embedding
- Average embeddings across all of a player's rallies to get a single player vector
- Reduce to 2D with UMAP (n_neighbors=15)
- Cluster with k-means (k=5)
- Visualize and save as
player_clusters.png
Filters: Players with fewer than 500 rallies are excluded; the top 100 players by rally count are used; players with more than 10,000 rallies are subsampled.
This is entirely unsupervised — the clusters emerge from patterns learned during language model training, not from any hand-crafted style labels.
Tennis-Rally-Language-Model/
├── config.py # ModelConfig and TrainConfig dataclasses
├── train.py # Training entry point
├── cluster_players.py # Post-training player embedding and clustering
├── utils.py # Device selection and rally utilities
├── data/
│ ├── dataset.py # Dataset class and data loading
│ ├── tokenizer.py # Custom tennis rally tokenizer
│ └── raw/ # Raw CSV files from Match Charting Project
├── models/
│ └── rally_transformer.py # GPT-style transformer model
├── training/
│ └── trainer.py # Training loop, LR schedule, checkpointing
├── checkpoints/ # Saved model checkpoints
└── tennis_notion_guide.md # Reference for the shot notation system
pip install -r requirements.txtTrain the model:
python train.pyRun player clustering (requires a trained checkpoint):
python cluster_players.py- Python 3.x
- PyTorch
- NumPy
- pandas
- scikit-learn
- umap-learn
- matplotlib