CosmOrford

How to build optimal summary statistics for weak gravitational lensing cosmology under a limited simulation budget?

This repository investigates how to build optimal summary statistics for weak gravitational lensing cosmology under a limited simulation budget. This work distills lessons learned from participating in the FAIR Universe - Weak Lensing ML Uncertainty Challenge.

We compare different strategies for building summary statistics — analytical, neural without pre-training, and neural with pre-training on cheaper simulations — within a unified evaluation framework.

📐 Evaluation framework

All summary strategies are evaluated through the same three-step pipeline, which ensures a fair comparison across approaches.

Step 1 — Compression to 8D. Every summary (analytical or neural) is compressed into an 8-dimensional vector. This shared dimensionality puts all approaches on equal footing for the downstream posterior estimation.

Step 2 — Neural Posterior Estimation (NPE). A Masked Autoregressive Flow (MAF) is trained on (summary, θ) pairs drawn from the holdout dataset — with noise augmentation applied to the maps before compression — to approximate the posterior p(Ω_m, S_8 | summary).

Step 3 — Figure of Merit (FoM). Posterior samples are drawn for maps from the fiducial split of the holdout dataset (Ω_m = 0.29, S_8 = 0.81). The FoM = 1 / sqrt(det Cov(Ω_m, S_8)) measures how tightly the posterior constrains the parameters.

Scripts:

Script	Description
`cosmoford/models_nopatch.py`	Compressor model, trained via `cosmoford/trainer.py`
`scripts/run_npe_budget_scan.py`	Trains the NPE flow and computes FoM, sweeping over simulation budgets
`scripts/plot_fom_budget.py`	Plots FoM vs. simulation budget from saved results

Datasets:

Dataset	Split	Used for
`CosmoStat/neurips-wl-challenge-flat`	`train` / `validation`	Compressor training and validation
`CosmoStat/neurips-wl-challenge-holdout`	`train`	NPE training (summaries precomputed with noise augmentation)
`CosmoStat/neurips-wl-challenge-holdout`	`fiducial`	FoM evaluation

⚠️ See below for how to access the datasets.

🗂️ Summary statistics strategies

🔢 Option A — Analytical summaries

Physically motivated statistics computed directly from the masked convergence maps, such as peak counts, wavelet ℓ₁-norm, or power spectrum. A small MLP is then trained to compress these hand-crafted features into an 8D vector by maximizing a Gaussian log-likelihood.

Training script: trainer fit -c <config TBD>

Dataset: CosmoStat/neurips-wl-challenge-flat

🧠 Option B — Neural compressor (no pre-training)

An EfficientNetV2-S network trained directly on the N-body simulations, compressing each convergence map to 8 summary statistics by maximizing the Gaussian log-likelihood.

Training script: trainer fit -c configs/experiments/efficientnet_v2_s_logp_.yaml

Dataset: CosmoStat/neurips-wl-challenge-flat

🚀 Option C — Neural compressor with pre-training

Same EfficientNetV2-S architecture, but first pre-trained on a larger set of cheaper simulations to reduce overfitting when the N-body budget is small, then fine-tuned on the N-body dataset. The compressor is trained with a Gaussian log-likelihood loss.

Fine-tuning script: trainer fit -c configs/finetune_from_pretrain_nopatch_logp.yaml

Update pretrained_checkpoint_path in the config to point to your pre-trained checkpoint.

Fine-tuning dataset: CosmoStat/neurips-wl-challenge-flat

Available pre-training datasets and their configs:

Simulation type	Local dataset	Pre-training config
Gaussian Random Field (GRF)	`CosmoStat/GRF_HF`	`None`
LogNormal	`CosmoStat/lognormal`	`configs/experiments/pretrain_lognormal_nopatch_logp.yaml`
Gower Street	`CosmoStat/gowerstreet-train`	`configs/experiments/pretrain_gowerstreet_nopatch_logp.yaml`
OT-emulated (from LogNormal)	`CosmoStat/ot_emulated`	`configs/pretrain_otemulated_nopatch_logp.yaml`
OT-emulated from TBD	output of the emulator (see below)	TBD

# Example: pre-train on LogNormal, then fine-tune on challenge data
trainer fit -c configs/experiments/pretrain_lognormal_nopatch_logp.yaml
trainer fit -c configs/finetune_from_pretrain_nopatch_logp.yaml

⚙️ Building the OT-emulated dataset

To bridge the gap between cheap simulations and the N-body distribution, a UNet emulator is trained using conditional optimal-transport flow matching (COT-FM). It maps LogNormal (or Gower Street) convergence maps to the distribution of N-body maps, conditioned on cosmological parameters. The emulated maps are then used as pre-training data for Option C.

Training script: cosmoford/emulator/cot_fm.py UNet configs: configs/unet_condition_small.yaml / configs/unet_condition_large.yaml Build HF dataset from emulated maps: scripts/hf_emulated_dataset.py

Dataset	Role
`CosmoStat/GRF_HF`	Cheap simulations to be corrected (GRF)
`CosmoStat/lognormal`	Cheap simulations to be corrected (LogNormal)
PM source	To be generated
`CosmoStat/neurips-wl-challenge-flat`	N-body target distribution for the emulator

python cosmoford/emulator/cot_fm.py \
    --config_yaml configs/unet_condition_large.yaml \
    --dataset_dir_nbody <path/to/neurips-wl-challenge-flat> \
    --dataset_dir_logn_train <path/to/GRF_HF> \
    --num_epochs 100

# Build the emulated HF dataset
python scripts/hf_emulated_dataset.py

🔧 Installation

pip install -e .

Requires Python ≥ 3.8. Key dependencies: torch, lightning, diffusers, torchdyn, nflows, datasets, wandb.

📦 Dataset loading

By default, datasets are loaded locally from /project/rrg-lplevass/shared/wl_chall_data/ (on the Rorqual cluster). The expected directory structure is:

/project/rrg-lplevass/shared/wl_chall_data/
├── neurips-wl-challenge-flat/   # Main challenge dataset (train/validation splits)
├── lognormal/                   # LogNormal pretraining data
├── gowerstreet-train/           # Gower Street pretraining data
├── ot_emulated/                 # OT-emulated pretraining data
└── GRF_HF/                     # Gaussian Random Field pretraining data

To load from HuggingFace Hub / GCS instead (e.g. when running outside the cluster), set use_hub: true in your config:

data:
  init_args:
    use_hub: true

To use a different local directory, set data_dir:

data:
  init_args:
    data_dir: /path/to/your/datasets

All options can also be passed as CLI overrides:

# Default: just pick a dataset mode (loads locally from the default path)
trainer fit -c configs/experiments/efficientnet_v2_s.yaml --data.dataset_mode=lognormal

# Load from HuggingFace Hub
trainer fit -c configs/experiments/efficientnet_v2_s.yaml --data.use_hub=true

# Load from a custom local path
trainer fit -c configs/experiments/efficientnet_v2_s.yaml --data.data_dir=/scratch/datasets

Available dataset_mode values: train, full, lognormal, gowerstreet, gowerstreet-train, ot_emulated, grf.

👥 Team


@AndreasTersenov	@ASKabalan	@b-remy
@EiffL	@noe-dia	@JuliaLinhart
@Justinezgh	@LaurencePeanuts	@SammyS15
@sachaguer	@rouzib

📝 License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
configs		configs
cosmoford		cosmoford
docs		docs
scripts		scripts
shell		shell
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fom_budget_scan.pdf		fom_budget_scan.pdf
pyproject.toml		pyproject.toml
train_modal.py		train_modal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CosmOrford

📐 Evaluation framework

🗂️ Summary statistics strategies

🔢 Option A — Analytical summaries

🧠 Option B — Neural compressor (no pre-training)

🚀 Option C — Neural compressor with pre-training

⚙️ Building the OT-emulated dataset

🔧 Installation

📦 Dataset loading

👥 Team

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CosmOrford

📐 Evaluation framework

🗂️ Summary statistics strategies

🔢 Option A — Analytical summaries

🧠 Option B — Neural compressor (no pre-training)

🚀 Option C — Neural compressor with pre-training

⚙️ Building the OT-emulated dataset

🔧 Installation

📦 Dataset loading

👥 Team

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages