Skip to content

Add run_fcre_aed_forecast entry function for FaaSr-deployable GLM-AED forecast#92

Open
Ashish-Ramrakhiani wants to merge 45 commits into
FLARE-forecast:mainfrom
Ashish-Ramrakhiani:flare_faasr_int
Open

Add run_fcre_aed_forecast entry function for FaaSr-deployable GLM-AED forecast#92
Ashish-Ramrakhiani wants to merge 45 commits into
FLARE-forecast:mainfrom
Ashish-Ramrakhiani:flare_faasr_int

Conversation

@Ashish-Ramrakhiani

Copy link
Copy Markdown

Summary

Adds a single-action entry function (workflows/glm_aed_flare_rs/run_fcre_aed_forecast.R)
that wraps the FCRE GLM-AED forecast loop in a form deployable as a FaaSr
action. The existing standalone driver (combined_run_aed.R) is preserved
unchanged and continues to work.

Changes

  • New workflows/glm_aed_flare_rs/run_fcre_aed_forecast.R — callable
    entry function with embedded forecast loop, soft time-budget for
    long-running-action targets, and config refresh after update_run_config.
  • Refactor R/convert_vera4cast_inflow.R, R/generate_forecast_score_arrow.R,
    and workflows/glm_aed_flare_rs/generate_inflow_forecast.R — accept an
    optional config argument; when supplied, route S3 reads/writes through
    the new FLAREr::flare_* dispatch layer. When config = NULL (the
    existing standalone-driver call pattern), behavior is unchanged.
  • Config bundle updates for FLAREr v3.1-dev-netcdf-v2 compatibility:
    • configure_run.yml: adds use_faasr flag
    • configure_flare_glm_aed.yml: base_AED_phyto_pars_nml_file,
      non_vertical_noise_config_file, restart_save_timesteps: "all";
      obs_filename switched to insitu
    • glm3.nml: init_restart_from_file, init_restart_fname,
      restart_fname placeholders
    • observations_config_aed.csv: model_source / model_variable /
      model_depth_m columns; secchi row dropped
    • non_vertical_noise_config.csv: new file (verbatim from upstream
      FLAREr v3.1-dev-netcdf-v2)

FaaSr_py uses GitHub tarball API for plain owner/repo entries, which
extracts to /tmp/functions/{InvocationID}/{Owner}-{Repo}-{sha}/, not
/{Owner}/{Repo}/. The previous glob expected the clone-style path and
fell through to here::here() = /action.
… sim_name

lake_directory discovery: replace the fork-username-specific glob
with a marker-based walk of /tmp/functions for any directory
containing configuration/<config_set>/<configure_run_file>. Works for
any fork without code changes; portable across the FaaSr clone vs
tarball extraction layouts.

configure_run.yml (deployment-only on flare_faasr_int): pin
forecast_start_datetime to 2026-05-01 (verified both noaa stage2 and
vera4cast inflow_gefsClimAED have data for that date) and rename
sim_name to glm_aed_flare_rs_faasr_test so all S3 writes
(restart/forecasts/scores) land in a sandbox path distinct from the
operational glm_aed_flare_v3 model_id.
… clobber

When sourced from run_fcre_aed_forecast.R, the unconditional
lake_directory <- here::here() / config_set_name <- 'glm_aed_flare_v3'
re-assignments overwrote the caller's already-resolved values, causing
the subsequent set_up_simulation to look in /action/configuration/
glm_aed_flare_v3/configure_run.yml and fail.

Wrap each top-level var in 'if (!exists(...))'. Console / RStudio users
still get the original defaults (no parent values present); programmatic
callers can pre-set lake_directory, config_set_name, configure_run_file,
config to avoid the clobber. Backward-compatible.
…lenv

faasr_source_r_files at action startup recursively sources every .R
file in the FCRE tarball. Many legacy scripts assign top-level
lake_directory <- here::here() and config_set_name <- 'ltreb' (etc.)
into globalenv. Those values persist.

source() defaults to local=FALSE which evaluates in globalenv. So
when run_fcre_aed_forecast did source(generate_inflow_forecast.R),
the sourced script's exists('lake_directory') check found the stale
globalenv value and skipped re-assignment, keeping
lake_directory='/action' and config_set_name='ltreb'. The subsequent
set_up_simulation tried '/action/configuration/ltreb/configure_aed_run.yml'
and failed.

Pass local=TRUE to all four source() calls so they evaluate in the
entry function's local environment, where lake_directory and
config_set_name are correctly set.
rstac's STAC datetime parameter requires RFC 3339 format
('YYYY-MM-DDTHH:MM:SSZ'). YAML loads start_datetime/forecast_start_datetime
as POSIXct or as space-separated strings, both of which produce
'YYYY-MM-DD HH:MM:SS' on paste() — rstac rejects with 'date time
provided not follow the RFC 3339 format'. Format with strftime to ISO
8601 / RFC 3339 before calling get_lst.

Wrap get_lst and get_vals in tryCatch so any failure (rstac error,
network blip, no images for the date range) falls cleanly into the
existing 'No new RS data' placeholder branch instead of crashing the
whole forecast iteration. Replaces the brittle exists('vals') check
with an explicit !is.null(vals).
Stage3 met data has duplicate (datetime, parameter, variable) entries
on 2025-07-22, 2025-07-24, 2025-07-25 (different prediction values
for the same key). With the prior 2026-05-01 pin and start_datetime
2024-10-01, the 19-month spinup window pulled these duplicates into
create_met_files's pivot_wider, producing list-columns and crashing
mutate(WindSpeed = sqrt(eastward_wind^2 + northward_wind^2)) with
"non-numeric argument to binary operator".

The 2024-10-01 -> 2024-12-20 window predates the duplicate zone, has
both noaa stage2 and vera4cast inflow data available, and runs ~7x
faster on spinup. The stage3 duplicate-data issue is an upstream
data-quality problem out of scope for this work.
Cuts the forward GLM-AED simulation from 34 days x 221 ensembles to
1 day x 221 ensembles. The spinup window
(start_datetime -> forecast_start_datetime) is unchanged at 80 days;
this only shrinks the forward forecast step. Deployment-only on
flare_faasr_int; the operational YAML keeps forecast_horizon: 34.
…ility

FLAREr v3.1-dev's generate_initial_conditions.R:128 does
init$inflation[] <- config$da_setup$inflation_factor; without the
field on the RHS, R errors with "replacement has length zero". The
field was introduced in v3.1-dev's param-fitting work and doesn't
exist in main, which is why production pipelines running off main
don't hit it.

1.0 = no inflation, matching every bundled extdata config in FLAREr
itself (default/, aed/, default_pf/, open_meteo/). Harmless on FLAREr
main, required on v3.1-dev.
v3.1-dev's run_da_forecast.R branches on
config$da_setup$add_random_noise (values 0/1/2) at lines 386, 402,
450, with no default. When the field is missing from the YAML, the
if() expression evaluates to a length-zero NULL and R errors with
"argument is of length zero".

FLAREr main unconditionally calls add_process_noise() per ensemble at
run_da_forecast.R:374 -- equivalent to v3.1-dev's add_random_noise == 2
branch. Setting 2 reproduces main's behaviour. Harmless on FLAREr
main (field ignored), required on v3.1-dev.
v3.1-dev's run_flare.R:145 reads config$model_settings$depth_sd into
states_non_vertical$depth_sd. Then run_da_forecast.R:388 / :446 use it
as the sd argument to rnorm() to perturb lake_depth per ensemble per
timestep. Without the YAML field, sd=NULL and rnorm errors with
'invalid arguments'.

FLAREr main has no states_non_vertical concept and no depth perturbation
in DA — depth_sd: 0.0 reproduces that behavior (rnorm(n, mean, sd=0)
returns mean exactly, no randomness). Harmless on main, required on
v3.1-dev.
start_datetime: 2024-12-17 (was 2024-10-01) -- DA loop runs 3 daily
iterations instead of 80, exercising the same code path with ~25x
less spinup compute.

ensemble_size: 31 (was 221) -- each DA / forecast step runs 31
GLM-AED simulations instead of 221, ~7x less per-step compute.

Combined: ~175x compute reduction for the GLM-AED-heavy DA loop.
Deployment-only change on flare_faasr_int; the operational YAML in
the upstream branch keeps start_datetime: 2024-10-01 and
ensemble_size: 221. The test only verifies pipeline wiring, not
forecast quality -- a 3-day spinup is not enough for AED state
convergence.
…_aed_flare_v3

Three filter sites in run_fcre_aed_forecast.R hardcoded
model_id == 'glm_aed_flare_v3' (carried over from combined_run_aed.R).
Two consequences when sim_name != 'glm_aed_flare_v3':

1. Reads the production forecast instead of the one we just wrote --
   the pipeline doesn't actually exercise its own output.
2. Score parquet auto-partitions by the model_id column from the
   filtered data, so writes land at
   .../scores/.../model_id=glm_aed_flare_v3/... -- additive pollution
   of the production namespace.

Replace the literal with config$run_config$sim_name at all three
sites. The script now reads the forecast it just produced, scores it,
and writes scores under its own sim_name; generic for any sim_name.
…the run

Submit can fail for transient reasons (vera4cast API down, rate limit,
schema validation) that aren't related to whether our forecast pipeline
worked. Catch and log instead of crashing — restart/forecast/score
parquet writes already succeeded at this point, the run should reach
completion regardless.

AWS env restore now runs after either branch (submit success or caught
error).
…config

Two fixes:

1. arrow::open_dataset partition listing was taking ~65min each over OSN
   because the prefix was the bucket root (flare/forecasts/parquet) and
   arrow had to enumerate every model_id x reference_date directory under
   the entire forecasts bucket to build the schema, even though the dplyr
   filter only wanted one partition. Push the partition keys into the
   prefix so arrow opens exactly one parquet file. The dplyr filter is
   now redundant and removed. Two sites (vera4cast aggregation block and
   score block, plus the past_forecasts variant).

2. update_run_config in v3.1-dev FLAREr has a config parameter not
   present in main. Internally it calls flare_put_file(config = config).
   Our caller wasn't passing config, so the inner call errored with
   'argument config is missing'. Add config = config to the call.
… still resolve

Side effect of pointing arrow::open_dataset directly at a leaf
partition path (site_id=X/model_id=Y/reference_date=Z): partition
columns are embedded in the path, not in the parquet file. arrow only
synthesizes them when discovering partitions from the bucket root. At
the leaf path, the resulting tibble has only the parquet's own columns
- no site_id, model_id, reference_date.

Downstream dplyr summarize(.by = c(..., site_id, model_id, ...)) and
write_dataset(partitioning = c('site_id','model_id','reference_date'))
both need those columns. Add them back via mutate() with the values we
already have (the same values we used to construct the prefix).

Three sites: vera4cast aggregation forecast_df, score block forecast_df,
score block past_forecasts.
… testing

Force get_run_config to ignore S3-leftover configure_run.yml and use
local YAML on every invoke. Without this, each test run inherits the
prior run's advanced forecast_start_datetime + restart_file pointer
and tries to resume -- which exposes latent bugs in loop logic that
don't manifest in a daily-cron production setting where each invoke
runs a single iteration.

Revert before merging to main: production runs rely on
clean_start=FALSE (default) so the daily restart chain works.
Reverting the deployment-only override added in 045470e. Test
repeatability will be handled by clearing the test sandbox's S3 state
before each invoke instead of skipping the S3-resume branch in code.
The set_up_simulation call returns to its default (clean_start=FALSE).
forecast_start_datetime: 2026-05-04 -- latest date with both noaa
stage2 and vera4cast inflow_gefsClimAED data available. After iter 1
update_run_config advances to 2026-05-05; no data published for that
date yet, so noaa_ready/inflow_ready flip FALSE and the while loop
exits naturally. Mirrors the daily-cron production pattern where each
cron firing runs exactly one iteration.

start_datetime: 2026-05-01 -- 3-day spinup, well clear of the
2025-07-22 to 2025-07-25 stage3 met-data duplicate zone.
After update_run_config writes the advanced forecast_start_datetime to
disk + S3, the outer-scope config in run_fcre_aed_forecast still holds
the previous values. Sourced helpers on the next iteration
(generate_inflow_forecast.R -> convert_vera4cast_inflow) read the stale
value and write inflow partitions under the previous reference_date,
while run_flare re-reads config from disk and looks under the new one
-- producing arrow IOError 'No such file or directory' on the
reference_date partition.

Refreshing config via set_up_simulation right after update_run_config
keeps the in-memory view aligned with on-disk state. combined_run_aed.R
doesn't trip this because each cron firing runs the loop exactly once
per process; this script runs multiple iterations in one process.
Iter 1 produces a restart NetCDF whose time vector spans
[start_datetime, forecast_start_datetime + horizon]. After iter 1, the
combined_run_aed.R-style carry-over math sets the next iteration's
start_datetime to old_forecast_start_datetime - 4 days. With a 3-day
spinup that target lands BEFORE the NetCDF's earliest time, so
which(datetime == new_sd) returns integer(0) and
generate_initial_conditions crashes with 'replacement has length zero'
when indexing pars_restart with the empty restart_index.

Bumping start_datetime from 2026-05-01 to 2026-04-30 gives a 4-day
spinup; iter 1's NetCDF then includes 2026-04-30, so iter 2 finds its
restart_index. Production daily-cron config doesn't trip this because
its 80-day spinup leaves plenty of cushion.
…leanly

FLAREr::check_noaa_present catches arrow IOError on a missing NOAA
partition with a tryCatch whose handler calls message(error_message),
passing the simpleError condition object. message() with a single
condition argument signals the condition as-is, preserving its error
class -- so the FaaSr executor's outer withCallingHandlers(error=)
catches it and marks the action as failed instead of letting the
function return FALSE.

The team's daily-cron pattern never trips this because their config
runs against past dates where stage2 NOAA is always present. Our
chained-iteration pattern eventually advances past the published
horizon and hits the catch arm.

Wrap the two call sites with our own tryCatch that converts any error
into noaa_ready=FALSE so the while-loop exits cleanly. Remove this
workaround when upstream check_noaa_present is fixed to use
message(conditionMessage(e)).
Only set GLM_PATH='GLM3r' when unset, so a container ENV (or any
explicit pre-set value) wins. Pass local_path to flare_arrow_s3_bucket
at the two forecasts_parquet read sites so the read location matches
where write_forecast writes in local mode.
FLAREr's run_flare on v3.1-dev-netcdf-v2 reads
config$model_settings$base_AED_phyto_pars_nml_file (with _file suffix)
and passes it to update_phy_states_obs_mapping, which calls read_nml
on it. Point the new field at aed2.nml, which contains the
aed_phytoplankton block the function expects.
Ashish-Ramrakhiani and others added 15 commits May 14, 2026 23:08
FLAREr's run_model on v3.1-dev-netcdf-v2 sets init_restart_from_file
via glmtools::set_nml() before each GLM run; the call requires the
variable to exist in the base nml. Adds the placeholder (=0) plus
init_restart_fname to match the bundled aed/default templates.
Last NML field FLAREr's run_model writes via glmtools::set_nml that
wasn't present in glm3.nml. Mirrors the bundled aed/default templates.
Previous defensive default could land on GLM3r in the FaaSr container,
which doesn't have GLM3r installed. Check for /opt/glm/glm first; that's
where the FaaSr container drops the embedded GLM binary. Log the
resolved value at function entry to make the chain transparent.
fcre-targets-rs.csv is the remote-sensing (LST) target file — usually
empty because cloud-free imagery is sparse. With it as obs_filename,
create_obs_matrix returned all-NA observations, obs_count was 0 at
every iter, and FLAREr's no-DA branch fired — which has a separate
uninitialized-pars_corr bug. Using fcre-targets-insitu.csv (the rich
daily-depth observations) matches Quinn's bundled aed config and
keeps every iter on the DA branch.
It was the only non-vertical entry (multi_depth=0). v3.1-dev-netcdf-v2
adds a new code path for non-vertical obs that requires model_source /
model_variable / model_depth_m columns plus a non_vertical_noise_config
csv — neither of which our config has. Without those, extract_modeled
errors with 'missing value where TRUE/FALSE needed'. Removing secchi
sidesteps the path entirely and matches what the rest of the FaaSr
test workflow exercises.
Three files brought in from FLARE-forecast/FLAREr@v3.1-dev-netcdf-v2's
inst/extdata/configuration/aed/ (per Quinn's reference):

- non_vertical_noise_config.csv (new file, verbatim from upstream)
- non_vertical_noise_config_file key added to configure_flare_glm_aed.yml
  pointing at the new file
- observations_config_aed.csv schema extended with model_source,
  model_variable, model_depth_m columns. Existing multi_depth=1 rows
  get NA in the new columns (no behavior change). secchi row restored
  using upstream's diagnostic/extc_coeff/1.0 pattern but keeping FCRE's
  Secchi_m_sample target name.
write_restart on v3.1-dev-netcdf-v2 returns .zip when GLM restart
staging produces extra files (the default for this branch) and .nc
otherwise. Hard-coding .nc in the next iteration's restart_file
caused get_restart_file to 404 on iter 2 because iter 1 had uploaded
the file as .zip. Capturing run_flare's return value and using
basename() on whatever it actually wrote keeps the chain consistent
regardless of which format the underlying write produced.
v3.1-dev-netcdf-v2's write_restart only saves model state for the
timesteps listed in output_settings$restart_save_timesteps (default:
just forecast_start_datetime). The next iter computes start_datetime =
prev_fsd - 4 days, which is not in the saved set, so generate_initial_
conditions can't find the right offset and aborts.

'all' tells write_restart to save every timestep; cheap (5 small
snapshots in the zip) and removes the offset-coupling between the
loop's start_datetime math and the restart save list.
…lity

Quinn's new write_restart on v3.1-dev-netcdf-v2 only persists GLM
output timesteps in the zip — spinup_start (which is the date the team's
'-4 days' pattern lands on) is never saved. generate_restart_initial_
conditions_from_zip then can't find that date and aborts with
'No GLM restart directory found for date X. Available dates: ...'.

Shift to -3 days so the next iter's start_datetime lands on the
earliest GLM-output date in the restart zip while still preserving a
4-day spinup window into the new forecast_start_datetime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant