Currently we use different rules and scripts to evaluate baseline forecasts and ML forecast runs. This creates redundancies and increases the danger of inconsistencies as quite a large fraction of the logic to read and verify is actually duplicated in the scripts.
Instead, we should refactor the code to work with either grib (ML forecast) or zarr (baseline) forecast inputs and consolidate the existing rules.