Manuscript: A structural lower bound on the applicability-domain gap in materials property prediction. Rohan Fossé and Gaël Pallares, CESI LINEACT, 2026. In preparation for submission to Machine Learning: Science and Technology (MLST), IOP Publishing.
This repository develops a theoretical and empirical study of the applicability-domain problem in materials informatics: the well-documented phenomenon by which a machine-learning model trained on one family of compositions generalises poorly to materials outside that family.
We give the problem its first structural lower bound: under spectral concentration of the property of interest on the composition-similarity graph, the gap between the two ubiquitous encodings of a material — its identity (categorical) and its compositional descriptor vector (compositional) — admits a closed-form lower bound depending only on the graph and the property, not on the regressor. The bound is verified empirically on MatBench, with two cross-domain controls (bike-share station demand, MovieLens user ratings) confirming that the phenomenon is not materials-specific but reflects a general property of graph-supervised learning under leave-node-out splits.
This is one of seven repositories in the cycling-data-lab GitHub organisation.
Note for reviewers from chemistry and materials science. The theoretical framework developed in this paper originated from graph signal processing research on spatial mobility networks — hence the
cycling-data-labGitHub organisation and the bike-share / MovieLens controls in the empirical section. The mathematical bound itself is not mobility- specific : it characterises any graph-supervised regression problem under a leave-node-out evaluation protocol. The present paper develops the materials-informatics instance, which is also the most physically interpretable one : the spectral hypothesis (H1) corresponds exactly to the compositional similarity principle of Curtarolo (2013) and the Hume-Rothery / Goldschmidt / Pettifor structural heuristics of solid-state chemistry. The cross-domain controls confirm that the bound is not an artefact of the materials setting.
Predicting properties of materials (formation energy, band gap,
elastic moduli, thermal conductivity) is one of the most active
areas of computational chemistry and materials science. Two
natural encodings of a material m are:
- Categorical encoding: one-hot indicator of the material identity. This is the default in any fixed-effect specification or matrix-factorisation recommender for materials.
- Compositional encoding: a low-dimensional feature vector summarising composition, structure or local environment (Magpie features, Crystal Graph Convolutional Network embeddings, learned fingerprints).
Empirically the two encodings are routinely interchanged in practice — one picks the one that gives the best cross-validation score on the available data. But the choice matters for the applicability-domain question:
Will the trained model also work on a material I have not seen?
The applicability-domain problem has been the subject of an entire methodological literature in QSAR and materials informatics (Tropsha 2010, Sahigara et al. 2012, Hanser et al. 2016). Most existing approaches are empirical: similarity-based confidence scores, k-nearest-neighbour distance to training set, or ensemble-disagreement heuristics. What has been missing is a structural theorem: a quantitative statement that the applicability-domain gap is bounded below by a property of the data, independent of the choice of regressor.
This paper provides one.
v1.0-rc, May 2026 — in preparation for MLST submission. The formal theorem, its proof sketch, and the full empirical validation are now complete. Headline result on the 8-task MatBench v0.1 regression panel:
| Statistic | Value |
|---|---|
| Spearman ρ(R²_spec, ΔR²_LSO) on n = 8 tasks (OLS projection vs measured gap) | +1.000, p_exact = 4.96 × 10⁻⁵ |
| Partial Spearman ρ on chemistry-grounded graph (controlling for log N, log Var(y)) | +0.79, p_exact = 0.032 |
| Partial Spearman ρ on shuffled-kNN null (degree- and clustering-matched) | ≈ +0.25 (d18 in progress) |
| Compositional Information Gain (CIG), 8 tasks | 30× to 150× above null |
| CHGNet vs Magpie encoder discrimination (Δε_K on mp_gap) | −0.13 (24 pp better) |
The remaining shuffled-kNN null run (experiments/d18_shuffled_knn_null.py)
will refine the third row; current numbers come from a topology-matched
Erdős–Rényi null.
Machine Learning: Science and Technology (MLST), IOP Publishing.
The manuscript is formatted with the official iopjournal.cls
(2024/01/31) and follows the IOP submission guidelines for
LaTeX articles.
materials-applicability-bound/
├── applicability_bound.tex # OFFICIAL MLST draft (v1.0-rc, iopjournal class)
├── iopjournal.cls # IOP class file (from publishingsupport.iopscience.iop.org)
├── .zenodo.json # Zenodo deposit metadata (auto-picked by Zenodo-GitHub bridge)
├── CITATION.cff # Citation File Format (rendered as "Cite this repository" on GitHub)
├── references/references.bib # BibTeX
├── experiments/ # d01-d24 numbered, reproducible
│ ├── _plot_style.py # Publication-quality plot helper (Paul Tol palette)
│ ├── d11_predictive_multitask.py # MatBench panel (LightGBM, 5-fold GroupKFold)
│ ├── d11b_expt_gap.py # 8th task: matbench_expt_gap (experimental band gap)
│ ├── d13_spectrum_predictive.py # Falsifiability statistics (chem vs ER null)
│ ├── d13_bootstrap_ci.py # Partial Spearman bootstrap CI (B = 10⁴)
│ ├── d18_shuffled_knn_null.py # Degree- and clustering-matched null (10 realisations × 8 tasks)
│ ├── d19_cig_figure.py # CIG bars figure
│ ├── d20_exact_permutation_test.py # Exact n! permutation p-value (n = 8 → 40 320)
│ ├── d24_chgnet_structural_eg.py # CHGNet encoder discrimination (foundation-model baseline)
│ └── ...
├── figures/ # Paper figures (publication-quality)
├── outputs/ # Per-experiment JSON / NPZ
├── drafts/ # Earlier drafts + supporting fragments
│ ├── applicability_bound_v0p3_long_form.tex # Pre-MLST Elsevier-style v0.3 (archived)
│ ├── mlst_abstract_v3.tex # MLST-style abstract draft (~245 words)
│ └── mlst_editorial_blocks.tex # Long-form statements (Reproducibility, Data, etc.)
├── RESEARCH_NOTES.md # Cross-machine sync notes (NOT for submission; delete before tag)
├── LICENSE # MIT
└── README.md
# Build the manuscript (figures must be in figures/, iopjournal.cls in the same dir)
pdflatex applicability_bound.tex
bibtex applicability_bound
pdflatex applicability_bound.tex
pdflatex applicability_bound.tex
# Re-run a numbered experiment (example)
python3.12 experiments/d11_predictive_multitask.py
python3.12 experiments/d13_spectrum_predictive.py
python3.12 experiments/d20_exact_permutation_test.pyMaster random seed is pinned to SEED = 42 in every script that
contains a stochastic step. Heavy caches (Magpie features,
Fourier coefficients) are NOT committed; they are deterministically
regenerated by the corresponding script on first run.
The v1.0-rc.3 release is archived on Zenodo with DOI
10.5281/zenodo.20355996.
Subsequent versioned releases (v1.0-rc.4, v1.0) are minted
automatically under the same Zenodo concept DOI via the
GitHub-Zenodo bridge enabled on this repository. The
.zenodo.json at the repository root carries the
deposit metadata.
A machine-readable citation is provided in
CITATION.cff (GitHub renders this as a
"Cite this repository" button). Plain BibTeX:
@unpublished{FossePallares2026applicabilityBound,
author = {Foss\'e, Rohan and Pallares, Ga\"el},
title = {A structural lower bound on the applicability-domain
gap in materials property prediction},
note = {Manuscript in preparation, CESI LINEACT, 2026.
\url{https://github.com/cycling-data-lab/materials-applicability-bound}},
year = {2026}
}- bikeshare-demand-forecasting — original empirical anchor (G vs G_FE LSO finding on bike-share stations); now a cross-domain control.
- bikeshare-gsp-tools — GSP toolkit; this paper sharpens Theorem 3 of its Mathematical foundations.
- imd-national-catalogue — IMD-4 features used as the compositional encoder in the bike-share control.
MIT.
Rohan Fossé — rfosse@cesi.fr — ORCID Gaël Pallares — ORCID