Accepted at the International Conference on Learning Representations, 2026.
Authors
Sumanth Varambally, Marshall Fisher, Jas Thakker, Yiwei Chen, Zhirui Xia, Yasaman Jafari, Ruijia Niu, Manas Jain, Veeramakali Vignesh Manivannan, Zachary Novack, Luyu Han, Srikar Eranky, Salva Rühling Cachay, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yian Ma, and Rose Yu
Links
arXiv | ICLR 2026 Poster | Dataset
Zephyrus pairs an LLM with ZephyrusWorld, a weather-science execution environment that exposes WeatherBench 2 data, geolocation utilities, forecasting, simulation, and climatology tools through Python APIs. This repository contains the agent implementations, the code execution server, the benchmark/evaluation pipeline, and the task-generation code used in the paper.
Figure from the paper: Zephyrus writes code, executes it against weather tools and datasets, observes the result, and iterates before answering.
ZephyrusWorldunifies WeatherBench 2 access, Natural Earth geolocation, a Stormer forecaster, a JAX-based simulator, and climatology queries.Zephyrus-Directsolves questions with one generated program;Zephyrus-Reflectiveuses a multi-turn execute-observe-refine loop.ZephyrusBenchcontains 2,230 question-answer pairs across 49 weather-science tasks.- Zephyrus improves correctness over text-only baselines by up to 44.2 percentage points.
Percentage of benchmark questions answered correctly across LLM backbones and model variants.
The FastAPI-based execution server distributes requests across workers and pooled weather tools.
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv .venv --python 3.11
source .venv/bin/activate
uv sync --activemkdir -p assets/NaturalEarth
cd assets/NaturalEarth
wget https://naciscdn.org/naturalearth/packages/natural_earth_vector.zip
unzip natural_earth_vector.zip
rm natural_earth_vector.zip
cd ../../Download the WeatherBench 2 ERA5 zarr used by the project and point configs/paths/default.yaml wb2_path to it. Note that this step requires you to install gsutil.
#!/bin/bash
DATA_DIR="./data/" # update to the desired path
DATASET="1959-2023_01_10-6h-240x121_equiangular_with_poles_conservative.zarr"
subdirs1=(
10m_u_component_of_wind
10m_v_component_of_wind
2m_temperature
geopotential
geopotential_at_surface
land_sea_mask
latitude
level
longitude
mean_sea_level_pressure
soil_type
specific_humidity
surface_pressure
temperature
time
u_component_of_wind
v_component_of_wind
mean_top_downward_short_wave_radiation_flux
)
DATA_DIR="$DATA_DIR/$DATASET"
mkdir -p "$DATA_DIR"
cd "$DATA_DIR"
gsutil -m cp -n \
"gs://weatherbench2/datasets/era5/$DATASET/.zattrs" \
"gs://weatherbench2/datasets/era5/$DATASET/.zgroup" \
"gs://weatherbench2/datasets/era5/$DATASET/.zmetadata" \
.
for subdir in "${subdirs1[@]}"; do
echo "Downloading $subdir"
gsutil -m cp -r -n "gs://weatherbench2/datasets/era5/$DATASET/$subdir" .
done
echo "Downloaded data to $DATA_DIR"WeatherBench 2 is large. Plan for roughly 550 GB of free disk space.
Install Stormer and fetch the checkpoint:
sh scripts/install_stormer.shTo download the cache, review and adapt scripts/download_cache.sh before running it so its output directories match your local config, then run:
sh scripts/download_cache.shBefore running the code, update these configs for your machine:
configs/paths/default.yaml:wb2_path,natural_earth_path,climatology_cache_dir,model_output_dir,model_output_cache_dirconfigs/model/agent.yaml,configs/model/pal.yaml,configs/model/api_text_only_llm.yaml:base_url,api_llm_modelconfigs/server.yaml:gpu_pool,num_workers,portconfigs/eval/evaluator.yaml: evaluator backend/model if you are not using the default OpenAI endpoint
For OpenAI-compatible backends:
export OPENAI_API_KEY="<your-key>"If your local or self-hosted OpenAI-compatible endpoint ignores auth, a dummy value is enough:
export OPENAI_API_KEY="none"For Gemini-based configs:
export GOOGLE_API_KEY="<your-key>"Download the ZephyrusBench dataset from this link, and update the dataset_path variable in configs/paths/default.yaml.
Make sure to move the simulation_outputs folder to cache/simulation_outputs.
python -m src.code_execution.serverThe server uses configs/server.yaml and defaults to port 8000.
The benchmark loader expects a directory containing one or more .json files. The benchmark can be downloaded from the Huggingface link.
Run the reflective agent:
python -m src.run model=agentRun the single-shot PAL baseline (Zephyrus-Direct):
python -m src.run model=palRun the text-only baseline:
python -m src.run model=api_text_only_llmTo resume from cached predictions:
python -m src.run model=agent resume=trueapi_text_only_llm does not need the execution server. agent and pal do.
python -m src.evaluateEvaluation scans model_outputs/, writes per-run _processed.json files, and produces summary logs. The evaluator also uses an LLM backend for answer extraction/verification.
Example vLLM server launch:
CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server \
--model /data/qwen_models/qwen3-coder-30b \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85 \
--port 8008To plug a local model into Zephyrus:
- Point
llm_client.base_urltohttp://localhost:8008/v1. - Set
llm_client.api_llm_modelto the model name exposed by vLLM. - Export
OPENAI_API_KEY="none"if your local endpoint does not enforce auth. - Start the Zephyrus code execution server in another terminal if you are using
agentorpal. - Run inference normally with
python -m src.run model=....
This uses the same OpenAI-compatible client path as hosted endpoints, so no separate local-model code path is required.
If you want to regenerate task data instead of using the released dataset, first edit the config files in configs/data/default.yaml. Then, run:
python -m src.generateGenerated outputs are written under the configured save directories in configs/paths/default.yaml.
Example output:
Generating prompt for prompt id: HIuSnl question id: dvzBtE
The following data shows the global data over a period of 42 hours, sampled at an interval of 6 hours. {'variables': ['2m_temperature'], 'time_indices': [496474, 496480, 496486, 496492, 496498, 496504, 496510]} Based on the above data, answer the following question: What is the median 2m_temperature in Rîşcani,MD? Based on the provided data, the median 2m_temperature at Rîşcani,MD is 292.8622131347656.
Generating prompt for prompt id: HIuSnl question id: QaajpN
The following data shows the global data over a period of 42 hours, sampled at an interval of 6 hours. {'variables': ['2m_temperature'], 'time_indices': [523641, 523647, 523653, 523659, 523665, 523671, 523677]} Based on the above data, answer the following question: Which continent experienced the lowest 2m_temperature? Based on the provided data, Antarctica experienced the lowest 2m_temperature over the specified time-period, with the lowest 2m_temperature of 205.49594116210938.
If you found our work useful, please consider citing our paper.
@inproceedings{
varambally2026zephyrus,
title={Zephyrus: An Agentic Framework for Weather Science},
author={Sumanth Varambally and Marshall Fisher and Jas Thakker and Yiwei Chen and Zhirui Xia and Yasaman Jafari and Ruijia Niu and Manas Jain and Veeramakali Vignesh Manivannan and Zachary Novack and Luyu Han and Srikar Eranky and Salva R{\"u}hling Cachay and Taylor Berg-Kirkpatrick and Duncan Watson-Parris and Yian Ma and Rose Yu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=aVeaNahsID}
}

