Skip to content

Ciela-Institute/CosmOrford

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

100 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CosmOrford

How to build optimal summary statistics for weak gravitational lensing cosmology under a limited simulation budget?

Challenge Python HuggingFace License: MIT

This repository investigates how to build optimal summary statistics for weak gravitational lensing cosmology under a limited simulation budget. This work distills lessons learned from participating in the FAIR Universe - Weak Lensing ML Uncertainty Challenge.

We compare different strategies for building summary statistics — analytical, neural without pre-training, and neural with pre-training on cheaper simulations — within a unified evaluation framework.


📐 Evaluation framework

All summary strategies are evaluated through the same three-step pipeline, which ensures a fair comparison across approaches.

Step 1 — Compression to 8D. Every summary (analytical or neural) is compressed into an 8-dimensional vector. This shared dimensionality puts all approaches on equal footing for the downstream posterior estimation.

Step 2 — Neural Posterior Estimation (NPE). A Masked Autoregressive Flow (MAF) is trained on (summary, θ) pairs drawn from the holdout dataset — with noise augmentation applied to the maps before compression — to approximate the posterior p(Ω_m, S_8 | summary).

Step 3 — Figure of Merit (FoM). Posterior samples are drawn for maps from the fiducial split of the holdout dataset (Ω_m = 0.29, S_8 = 0.81). The FoM = 1 / sqrt(det Cov(Ω_m, S_8)) measures how tightly the posterior constrains the parameters.

Scripts:

Script Description
cosmoford/models_nopatch.py Compressor model, trained via cosmoford/trainer.py
scripts/run_npe_budget_scan.py Trains the NPE flow and computes FoM, sweeping over simulation budgets
scripts/plot_fom_budget.py Plots FoM vs. simulation budget from saved results

Datasets:

Dataset Split Used for
CosmoStat/neurips-wl-challenge-flat train / validation Compressor training and validation
CosmoStat/neurips-wl-challenge-holdout train NPE training (summaries precomputed with noise augmentation)
CosmoStat/neurips-wl-challenge-holdout fiducial FoM evaluation

⚠️ See below for how to access the datasets.


🗂️ Summary statistics strategies

🔢 Option A — Analytical summaries

Physically motivated statistics computed directly from the masked convergence maps, such as peak counts, wavelet ℓ₁-norm, or power spectrum. A small MLP is then trained to compress these hand-crafted features into an 8D vector by maximizing a Gaussian log-likelihood.

Training script: trainer fit -c <config TBD>

Dataset: CosmoStat/neurips-wl-challenge-flat


🧠 Option B — Neural compressor (no pre-training)

An EfficientNetV2-S network trained directly on the N-body simulations, compressing each convergence map to 8 summary statistics by maximizing the Gaussian log-likelihood.

Training script: trainer fit -c configs/experiments/efficientnet_v2_s_logp_.yaml

Dataset: CosmoStat/neurips-wl-challenge-flat


🚀 Option C — Neural compressor with pre-training

Same EfficientNetV2-S architecture, but first pre-trained on a larger set of cheaper simulations to reduce overfitting when the N-body budget is small, then fine-tuned on the N-body dataset. The compressor is trained with a Gaussian log-likelihood loss.

Fine-tuning script: trainer fit -c configs/finetune_from_pretrain_nopatch_logp.yaml

Update pretrained_checkpoint_path in the config to point to your pre-trained checkpoint.

Fine-tuning dataset: CosmoStat/neurips-wl-challenge-flat

Available pre-training datasets and their configs:

Simulation type Local dataset Pre-training config
Gaussian Random Field (GRF) CosmoStat/GRF_HF None
LogNormal CosmoStat/lognormal configs/experiments/pretrain_lognormal_nopatch_logp.yaml
Gower Street CosmoStat/gowerstreet-train configs/experiments/pretrain_gowerstreet_nopatch_logp.yaml
OT-emulated (from LogNormal) CosmoStat/ot_emulated configs/pretrain_otemulated_nopatch_logp.yaml
OT-emulated from TBD output of the emulator (see below) TBD
# Example: pre-train on LogNormal, then fine-tune on challenge data
trainer fit -c configs/experiments/pretrain_lognormal_nopatch_logp.yaml
trainer fit -c configs/finetune_from_pretrain_nopatch_logp.yaml

⚙️ Building the OT-emulated dataset

To bridge the gap between cheap simulations and the N-body distribution, a UNet emulator is trained using conditional optimal-transport flow matching (COT-FM). It maps LogNormal (or Gower Street) convergence maps to the distribution of N-body maps, conditioned on cosmological parameters. The emulated maps are then used as pre-training data for Option C.

Training script: cosmoford/emulator/cot_fm.py UNet configs: configs/unet_condition_small.yaml / configs/unet_condition_large.yaml Build HF dataset from emulated maps: scripts/hf_emulated_dataset.py

Dataset Role
CosmoStat/GRF_HF Cheap simulations to be corrected (GRF)
CosmoStat/lognormal Cheap simulations to be corrected (LogNormal)
PM source To be generated
CosmoStat/neurips-wl-challenge-flat N-body target distribution for the emulator
python cosmoford/emulator/cot_fm.py \
    --config_yaml configs/unet_condition_large.yaml \
    --dataset_dir_nbody <path/to/neurips-wl-challenge-flat> \
    --dataset_dir_logn_train <path/to/GRF_HF> \
    --num_epochs 100

# Build the emulated HF dataset
python scripts/hf_emulated_dataset.py

🔧 Installation

pip install -e .

Requires Python ≥ 3.8. Key dependencies: torch, lightning, diffusers, torchdyn, nflows, datasets, wandb.


📦 Dataset loading

By default, datasets are loaded locally from /project/rrg-lplevass/shared/wl_chall_data/ (on the Rorqual cluster). The expected directory structure is:

/project/rrg-lplevass/shared/wl_chall_data/
├── neurips-wl-challenge-flat/   # Main challenge dataset (train/validation splits)
├── lognormal/                   # LogNormal pretraining data
├── gowerstreet-train/           # Gower Street pretraining data
├── ot_emulated/                 # OT-emulated pretraining data
└── GRF_HF/                     # Gaussian Random Field pretraining data

To load from HuggingFace Hub / GCS instead (e.g. when running outside the cluster), set use_hub: true in your config:

data:
  init_args:
    use_hub: true

To use a different local directory, set data_dir:

data:
  init_args:
    data_dir: /path/to/your/datasets

All options can also be passed as CLI overrides:

# Default: just pick a dataset mode (loads locally from the default path)
trainer fit -c configs/experiments/efficientnet_v2_s.yaml --data.dataset_mode=lognormal

# Load from HuggingFace Hub
trainer fit -c configs/experiments/efficientnet_v2_s.yaml --data.use_hub=true

# Load from a custom local path
trainer fit -c configs/experiments/efficientnet_v2_s.yaml --data.data_dir=/scratch/datasets

Available dataset_mode values: train, full, lognormal, gowerstreet, gowerstreet-train, ot_emulated, grf.


👥 Team

@AndreasTersenov @ASKabalan @b-remy
@EiffL @noe-dia @JuliaLinhart
@Justinezgh @LaurencePeanuts @SammyS15
@sachaguer @rouzib

📝 License

See LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors