This repository contains the practical part of a master's thesis focused on the use of Data Science methods in football. The project is divided into separate analytical blocks that are methodologically related, but each of them can also be read on its own.
-
xG/
Exploratory analysis of expected-value metrics in football. This section works withxG,xGA, andxPts, compares leagues, identifies long-term overperformance and underperformance, studies Leicester City's 2015/16 Premier League season, and explores team styles through clustering. -
match_prediction/
A prediction-focused section aimed at building the strongest possible workflow for forecasting matches in the current German Bundesliga season. It covers data preparation, feature engineering, machine learning models, a double Poisson approach, market benchmarking, and next-matchday predictions.
Depending on the reader's goal, the project can be approached in several ways:
- For a quick overview, start with this file and then move to the README of the selected section.
- For the methodological interpretation of the
xGanalysis, openxG/xG_analysis.ipynband usexG/src/as supporting documentation. - For the predictive pipeline, start with
match_prediction/README.md, then continue withmatch_prediction/notebooks/README.mdand follow the notebook order. - For results and visual outputs, use
xG/Plots/andmatch_prediction/outputs/.
DP/
|-- README.md
|-- xG/
| |-- README.md
| |-- xG_analysis.ipynb
| |-- Data/
| |-- Plots/
| `-- src/
`-- match_prediction/
|-- README.md
|-- notebooks/
|-- src/
|-- data/
`-- outputs/
- The repository is primarily notebook-driven: the main analytical narrative is developed in
.ipynbfiles. - Shared logic is moved into
src/modules to keep the work reproducible and reusable. - Data folders are separated by pipeline stage into
raw,interim, andprocessed. - Outputs intended for interpretation and presentation are stored separately in
outputs. - README files inside subfolders act as local guides that explain what the folder contains, why it exists, and when it matters in the workflow.