Tennis Rally Language Model

A GPT-style transformer trained to model tennis rally sequences as a language, learning to predict the next shot given a sequence of previous shots. The model learns rich representations of player style, enabling downstream analysis like player clustering by playing patterns.

Motivation

Tennis rallies follow structured patterns — shot type, direction, court position — that can be encoded as discrete tokens, just like words in a sentence. This project applies causal language modeling to tennis rally data to ask: can a model learn the "grammar" of tennis? The learned embeddings are then used to cluster players by playing style without any manually defined features.

Dataset

Source: Tennis Match Charting Project — a crowd-sourced, open dataset of shot-by-shot match charting.

Three CSV files cover different eras:

File	Period	Size
`charting-m-points-to-2009.csv`	Pre-2010	~1.1 MB
`charting-m-points-2010s.csv`	2010–2019	~36 MB
`charting-m-points-2020s.csv`	2020–present	~53 MB

Scale: 1,221,862 points across 7,194 matches spanning from the 1960s through the 2020s.

Each row represents one point and contains a rally encoded in the Match Charting notation — a compact string where each character or pair of characters encodes a shot event. For example, 4f39f3f3d@ means: serve wide → deep forehand return → two cross-court forehand rallies → unforced error into the net.

Preprocessing (data/dataset.py):

Selects the second serve column when a fault occurred, otherwise the first serve
Filters out points with no recognized point ending (* winner, @ unforced error, # forced error)
Drops sequences shorter than 3 tokens
95/5 train/validation split

Tokenizer

A custom rule-based tokenizer (data/tokenizer.py) converts raw rally strings into integer token IDs. Vocabulary size is ~900 tokens.

Token categories:

Category	Examples	Meaning
Serve direction	`4`, `5`, `6`	Wide, body, T
Shot type	`f`, `b`, `v`, `o`, `u`	Forehand, backhand, volley, smash, drop shot
Direction	`1`, `2`, `3`	Forehand side, middle, backhand side
Position	`+`, `-`, `=`	Approach, net, baseline
Return depth	`7`, `8`, `9`	Shallow, moderate, deep
Point ending	`*`, `@`, `#`	Winner, unforced error, forced error
Error type	`n`, `w`, `d`	Net, wide, deep

The tokenizer parses left-to-right using a greedy longest-match strategy — e.g., f39 is tokenized as a single return token (forehand, backhand side, deep) rather than three individual characters.

Each sequence is wrapped with <bos> / <eos> tokens, padded to a maximum length of 64, and truncated if necessary.

Model Architecture

A GPT-2 style decoder-only transformer (models/rally_transformer.py) with causal self-attention.

Vocab size:       900
Max sequence len: 64
Hidden dim:       128
Layers:           4
Attention heads:  4
FFN dim:          512
Dropout:          0.1
Parameters:       ~2.3M

Key design choices:

Causal masking: Each position attends only to previous positions, enabling next-token prediction at every step
Pre-LayerNorm: Normalizes inputs before each sublayer for better training stability
Weight tying: The output projection shares weights with the token embedding matrix, reducing parameters while keeping input/output semantics aligned
GPT-2 initialization: Weights initialized from N(0, 0.02); residual projections scaled by 1/√(2 × n_layers)

Training

Entry point: train.py
Trainer: training/trainer.py

Hyperparameter	Value
Batch size	256
Max epochs	20
Peak learning rate	3e-4
LR schedule	Cosine decay with linear warmup (200 steps)
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Weight decay	0.1 (weights only; biases/LayerNorm excluded)
Gradient clipping	1.0

Loss: cross-entropy over next-token prediction, ignoring padding positions.
Validation runs every 500 steps; checkpoints saved every 1000 steps with best-model tracking.
Device selection is automatic: CUDA → MPS → CPU.

The model was trained to step 76,980 (full 20 epochs), producing 76 checkpoints.

Player Style Clustering

cluster_players.py uses the trained model as a feature extractor to cluster players by playing style — no labels required.

Pipeline:

Load all rallies for each player from the dataset
Tokenize and pass each rally through the transformer (excluding the final LM head)
Mean-pool over the sequence dimension to get a fixed-size rally embedding
Average embeddings across all of a player's rallies to get a single player vector
Reduce to 2D with UMAP (n_neighbors=15)
Cluster with k-means (k=5)
Visualize and save as player_clusters.png

Filters: Players with fewer than 500 rallies are excluded; the top 100 players by rally count are used; players with more than 10,000 rallies are subsampled.

This is entirely unsupervised — the clusters emerge from patterns learned during language model training, not from any hand-crafted style labels.

Project Structure

Tennis-Rally-Language-Model/
├── config.py                  # ModelConfig and TrainConfig dataclasses
├── train.py                   # Training entry point
├── cluster_players.py         # Post-training player embedding and clustering
├── utils.py                   # Device selection and rally utilities
├── data/
│   ├── dataset.py             # Dataset class and data loading
│   ├── tokenizer.py           # Custom tennis rally tokenizer
│   └── raw/                   # Raw CSV files from Match Charting Project
├── models/
│   └── rally_transformer.py   # GPT-style transformer model
├── training/
│   └── trainer.py             # Training loop, LR schedule, checkpointing
├── checkpoints/               # Saved model checkpoints
└── tennis_notion_guide.md     # Reference for the shot notation system

Setup

pip install -r requirements.txt

Train the model:

python train.py

Run player clustering (requires a trained checkpoint):

python cluster_players.py

Dependencies

Python 3.x
PyTorch
NumPy
pandas
scikit-learn
umap-learn
matplotlib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tennis Rally Language Model

Motivation

Dataset

Tokenizer

Model Architecture

Training

Player Style Clustering

Project Structure

Setup

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
data		data
models		models
notebooks		notebooks
training		training
.gitignore		.gitignore
README.md		README.md
cluster_players.py		cluster_players.py
config.py		config.py
player_clusters.png		player_clusters.png
requirements.txt		requirements.txt
tennis_notion_guide.md		tennis_notion_guide.md
train.py		train.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Tennis Rally Language Model

Motivation

Dataset

Tokenizer

Model Architecture

Training

Player Style Clustering

Project Structure

Setup

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages