Skip to content

cycling-data-lab/materials-applicability-bound

Repository files navigation

materials-applicability-bound

License: MIT Python 3.12+ Status: v1.0-rc (MLST submission) DOI

Manuscript: A structural lower bound on the applicability-domain gap in materials property prediction. Rohan Fossé and Gaël Pallares, CESI LINEACT, 2026. In preparation for submission to Machine Learning: Science and Technology (MLST), IOP Publishing.

This repository develops a theoretical and empirical study of the applicability-domain problem in materials informatics: the well-documented phenomenon by which a machine-learning model trained on one family of compositions generalises poorly to materials outside that family.

We give the problem its first structural lower bound: under spectral concentration of the property of interest on the composition-similarity graph, the gap between the two ubiquitous encodings of a material — its identity (categorical) and its compositional descriptor vector (compositional) — admits a closed-form lower bound depending only on the graph and the property, not on the regressor. The bound is verified empirically on MatBench, with two cross-domain controls (bike-share station demand, MovieLens user ratings) confirming that the phenomenon is not materials-specific but reflects a general property of graph-supervised learning under leave-node-out splits.

This is one of seven repositories in the cycling-data-lab GitHub organisation.

Note for reviewers from chemistry and materials science. The theoretical framework developed in this paper originated from graph signal processing research on spatial mobility networks — hence the cycling-data-lab GitHub organisation and the bike-share / MovieLens controls in the empirical section. The mathematical bound itself is not mobility- specific : it characterises any graph-supervised regression problem under a leave-node-out evaluation protocol. The present paper develops the materials-informatics instance, which is also the most physically interpretable one : the spectral hypothesis (H1) corresponds exactly to the compositional similarity principle of Curtarolo (2013) and the Hume-Rothery / Goldschmidt / Pettifor structural heuristics of solid-state chemistry. The cross-domain controls confirm that the bound is not an artefact of the materials setting.

The puzzle

Predicting properties of materials (formation energy, band gap, elastic moduli, thermal conductivity) is one of the most active areas of computational chemistry and materials science. Two natural encodings of a material m are:

  • Categorical encoding: one-hot indicator of the material identity. This is the default in any fixed-effect specification or matrix-factorisation recommender for materials.
  • Compositional encoding: a low-dimensional feature vector summarising composition, structure or local environment (Magpie features, Crystal Graph Convolutional Network embeddings, learned fingerprints).

Empirically the two encodings are routinely interchanged in practice — one picks the one that gives the best cross-validation score on the available data. But the choice matters for the applicability-domain question:

Will the trained model also work on a material I have not seen?

The applicability-domain problem has been the subject of an entire methodological literature in QSAR and materials informatics (Tropsha 2010, Sahigara et al. 2012, Hanser et al. 2016). Most existing approaches are empirical: similarity-based confidence scores, k-nearest-neighbour distance to training set, or ensemble-disagreement heuristics. What has been missing is a structural theorem: a quantitative statement that the applicability-domain gap is bounded below by a property of the data, independent of the choice of regressor.

This paper provides one.

Status and headline result

v1.0-rc, May 2026 — in preparation for MLST submission. The formal theorem, its proof sketch, and the full empirical validation are now complete. Headline result on the 8-task MatBench v0.1 regression panel:

Statistic Value
Spearman ρ(R²_spec, ΔR²_LSO) on n = 8 tasks (OLS projection vs measured gap) +1.000, p_exact = 4.96 × 10⁻⁵
Partial Spearman ρ on chemistry-grounded graph (controlling for log N, log Var(y)) +0.79, p_exact = 0.032
Partial Spearman ρ on shuffled-kNN null (degree- and clustering-matched) ≈ +0.25 (d18 in progress)
Compositional Information Gain (CIG), 8 tasks 30× to 150× above null
CHGNet vs Magpie encoder discrimination (Δε_K on mp_gap) −0.13 (24 pp better)

The remaining shuffled-kNN null run (experiments/d18_shuffled_knn_null.py) will refine the third row; current numbers come from a topology-matched Erdős–Rényi null.

Target venue (locked)

Machine Learning: Science and Technology (MLST), IOP Publishing. The manuscript is formatted with the official iopjournal.cls (2024/01/31) and follows the IOP submission guidelines for LaTeX articles.

What's in here

materials-applicability-bound/
├── applicability_bound.tex             # OFFICIAL MLST draft (v1.0-rc, iopjournal class)
├── iopjournal.cls                      # IOP class file (from publishingsupport.iopscience.iop.org)
├── .zenodo.json                        # Zenodo deposit metadata (auto-picked by Zenodo-GitHub bridge)
├── CITATION.cff                        # Citation File Format (rendered as "Cite this repository" on GitHub)
├── references/references.bib           # BibTeX
├── experiments/                        # d01-d24 numbered, reproducible
│   ├── _plot_style.py                    # Publication-quality plot helper (Paul Tol palette)
│   ├── d11_predictive_multitask.py       # MatBench panel (LightGBM, 5-fold GroupKFold)
│   ├── d11b_expt_gap.py                  # 8th task: matbench_expt_gap (experimental band gap)
│   ├── d13_spectrum_predictive.py        # Falsifiability statistics (chem vs ER null)
│   ├── d13_bootstrap_ci.py               # Partial Spearman bootstrap CI (B = 10⁴)
│   ├── d18_shuffled_knn_null.py          # Degree- and clustering-matched null (10 realisations × 8 tasks)
│   ├── d19_cig_figure.py                 # CIG bars figure
│   ├── d20_exact_permutation_test.py     # Exact n! permutation p-value (n = 8 → 40 320)
│   ├── d24_chgnet_structural_eg.py       # CHGNet encoder discrimination (foundation-model baseline)
│   └── ...
├── figures/                            # Paper figures (publication-quality)
├── outputs/                            # Per-experiment JSON / NPZ
├── drafts/                             # Earlier drafts + supporting fragments
│   ├── applicability_bound_v0p3_long_form.tex   # Pre-MLST Elsevier-style v0.3 (archived)
│   ├── mlst_abstract_v3.tex                     # MLST-style abstract draft (~245 words)
│   └── mlst_editorial_blocks.tex                # Long-form statements (Reproducibility, Data, etc.)
├── RESEARCH_NOTES.md                   # Cross-machine sync notes (NOT for submission; delete before tag)
├── LICENSE                             # MIT
└── README.md

Reproducing the paper

# Build the manuscript (figures must be in figures/, iopjournal.cls in the same dir)
pdflatex applicability_bound.tex
bibtex   applicability_bound
pdflatex applicability_bound.tex
pdflatex applicability_bound.tex

# Re-run a numbered experiment (example)
python3.12 experiments/d11_predictive_multitask.py
python3.12 experiments/d13_spectrum_predictive.py
python3.12 experiments/d20_exact_permutation_test.py

Master random seed is pinned to SEED = 42 in every script that contains a stochastic step. Heavy caches (Magpie features, Fourier coefficients) are NOT committed; they are deterministically regenerated by the corresponding script on first run.

Zenodo

The v1.0-rc.3 release is archived on Zenodo with DOI 10.5281/zenodo.20355996. Subsequent versioned releases (v1.0-rc.4, v1.0) are minted automatically under the same Zenodo concept DOI via the GitHub-Zenodo bridge enabled on this repository. The .zenodo.json at the repository root carries the deposit metadata.

How to cite

A machine-readable citation is provided in CITATION.cff (GitHub renders this as a "Cite this repository" button). Plain BibTeX:

@unpublished{FossePallares2026applicabilityBound,
  author = {Foss\'e, Rohan and Pallares, Ga\"el},
  title  = {A structural lower bound on the applicability-domain
            gap in materials property prediction},
  note   = {Manuscript in preparation, CESI LINEACT, 2026.
            \url{https://github.com/cycling-data-lab/materials-applicability-bound}},
  year   = {2026}
}

Sibling repos

License

MIT.

Contact

Rohan Fossé — rfosse@cesi.frORCID Gaël Pallares — ORCID

About

Paper (early draft): a structural lower bound on the applicability-domain gap in materials property prediction. Categorical (material-ID) vs compositional (Magpie/CGCNN) encoders, graph-spectral framework, validation on MatBench + bike-share + MovieLens.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors