Add run_fcre_aed_forecast entry function for FaaSr-deployable GLM-AED forecast#92
Open
Ashish-Ramrakhiani wants to merge 45 commits into
Open
Add run_fcre_aed_forecast entry function for FaaSr-deployable GLM-AED forecast#92Ashish-Ramrakhiani wants to merge 45 commits into
Ashish-Ramrakhiani wants to merge 45 commits into
Conversation
FaaSr_py uses GitHub tarball API for plain owner/repo entries, which
extracts to /tmp/functions/{InvocationID}/{Owner}-{Repo}-{sha}/, not
/{Owner}/{Repo}/. The previous glob expected the clone-style path and
fell through to here::here() = /action.
… sim_name lake_directory discovery: replace the fork-username-specific glob with a marker-based walk of /tmp/functions for any directory containing configuration/<config_set>/<configure_run_file>. Works for any fork without code changes; portable across the FaaSr clone vs tarball extraction layouts. configure_run.yml (deployment-only on flare_faasr_int): pin forecast_start_datetime to 2026-05-01 (verified both noaa stage2 and vera4cast inflow_gefsClimAED have data for that date) and rename sim_name to glm_aed_flare_rs_faasr_test so all S3 writes (restart/forecasts/scores) land in a sandbox path distinct from the operational glm_aed_flare_v3 model_id.
… clobber When sourced from run_fcre_aed_forecast.R, the unconditional lake_directory <- here::here() / config_set_name <- 'glm_aed_flare_v3' re-assignments overwrote the caller's already-resolved values, causing the subsequent set_up_simulation to look in /action/configuration/ glm_aed_flare_v3/configure_run.yml and fail. Wrap each top-level var in 'if (!exists(...))'. Console / RStudio users still get the original defaults (no parent values present); programmatic callers can pre-set lake_directory, config_set_name, configure_run_file, config to avoid the clobber. Backward-compatible.
…lenv
faasr_source_r_files at action startup recursively sources every .R
file in the FCRE tarball. Many legacy scripts assign top-level
lake_directory <- here::here() and config_set_name <- 'ltreb' (etc.)
into globalenv. Those values persist.
source() defaults to local=FALSE which evaluates in globalenv. So
when run_fcre_aed_forecast did source(generate_inflow_forecast.R),
the sourced script's exists('lake_directory') check found the stale
globalenv value and skipped re-assignment, keeping
lake_directory='/action' and config_set_name='ltreb'. The subsequent
set_up_simulation tried '/action/configuration/ltreb/configure_aed_run.yml'
and failed.
Pass local=TRUE to all four source() calls so they evaluate in the
entry function's local environment, where lake_directory and
config_set_name are correctly set.
rstac's STAC datetime parameter requires RFC 3339 format
('YYYY-MM-DDTHH:MM:SSZ'). YAML loads start_datetime/forecast_start_datetime
as POSIXct or as space-separated strings, both of which produce
'YYYY-MM-DD HH:MM:SS' on paste() — rstac rejects with 'date time
provided not follow the RFC 3339 format'. Format with strftime to ISO
8601 / RFC 3339 before calling get_lst.
Wrap get_lst and get_vals in tryCatch so any failure (rstac error,
network blip, no images for the date range) falls cleanly into the
existing 'No new RS data' placeholder branch instead of crashing the
whole forecast iteration. Replaces the brittle exists('vals') check
with an explicit !is.null(vals).
Stage3 met data has duplicate (datetime, parameter, variable) entries on 2025-07-22, 2025-07-24, 2025-07-25 (different prediction values for the same key). With the prior 2026-05-01 pin and start_datetime 2024-10-01, the 19-month spinup window pulled these duplicates into create_met_files's pivot_wider, producing list-columns and crashing mutate(WindSpeed = sqrt(eastward_wind^2 + northward_wind^2)) with "non-numeric argument to binary operator". The 2024-10-01 -> 2024-12-20 window predates the duplicate zone, has both noaa stage2 and vera4cast inflow data available, and runs ~7x faster on spinup. The stage3 duplicate-data issue is an upstream data-quality problem out of scope for this work.
Cuts the forward GLM-AED simulation from 34 days x 221 ensembles to 1 day x 221 ensembles. The spinup window (start_datetime -> forecast_start_datetime) is unchanged at 80 days; this only shrinks the forward forecast step. Deployment-only on flare_faasr_int; the operational YAML keeps forecast_horizon: 34.
…ility FLAREr v3.1-dev's generate_initial_conditions.R:128 does init$inflation[] <- config$da_setup$inflation_factor; without the field on the RHS, R errors with "replacement has length zero". The field was introduced in v3.1-dev's param-fitting work and doesn't exist in main, which is why production pipelines running off main don't hit it. 1.0 = no inflation, matching every bundled extdata config in FLAREr itself (default/, aed/, default_pf/, open_meteo/). Harmless on FLAREr main, required on v3.1-dev.
v3.1-dev's run_da_forecast.R branches on config$da_setup$add_random_noise (values 0/1/2) at lines 386, 402, 450, with no default. When the field is missing from the YAML, the if() expression evaluates to a length-zero NULL and R errors with "argument is of length zero". FLAREr main unconditionally calls add_process_noise() per ensemble at run_da_forecast.R:374 -- equivalent to v3.1-dev's add_random_noise == 2 branch. Setting 2 reproduces main's behaviour. Harmless on FLAREr main (field ignored), required on v3.1-dev.
v3.1-dev's run_flare.R:145 reads config$model_settings$depth_sd into states_non_vertical$depth_sd. Then run_da_forecast.R:388 / :446 use it as the sd argument to rnorm() to perturb lake_depth per ensemble per timestep. Without the YAML field, sd=NULL and rnorm errors with 'invalid arguments'. FLAREr main has no states_non_vertical concept and no depth perturbation in DA — depth_sd: 0.0 reproduces that behavior (rnorm(n, mean, sd=0) returns mean exactly, no randomness). Harmless on main, required on v3.1-dev.
start_datetime: 2024-12-17 (was 2024-10-01) -- DA loop runs 3 daily iterations instead of 80, exercising the same code path with ~25x less spinup compute. ensemble_size: 31 (was 221) -- each DA / forecast step runs 31 GLM-AED simulations instead of 221, ~7x less per-step compute. Combined: ~175x compute reduction for the GLM-AED-heavy DA loop. Deployment-only change on flare_faasr_int; the operational YAML in the upstream branch keeps start_datetime: 2024-10-01 and ensemble_size: 221. The test only verifies pipeline wiring, not forecast quality -- a 3-day spinup is not enough for AED state convergence.
…_aed_flare_v3 Three filter sites in run_fcre_aed_forecast.R hardcoded model_id == 'glm_aed_flare_v3' (carried over from combined_run_aed.R). Two consequences when sim_name != 'glm_aed_flare_v3': 1. Reads the production forecast instead of the one we just wrote -- the pipeline doesn't actually exercise its own output. 2. Score parquet auto-partitions by the model_id column from the filtered data, so writes land at .../scores/.../model_id=glm_aed_flare_v3/... -- additive pollution of the production namespace. Replace the literal with config$run_config$sim_name at all three sites. The script now reads the forecast it just produced, scores it, and writes scores under its own sim_name; generic for any sim_name.
…the run Submit can fail for transient reasons (vera4cast API down, rate limit, schema validation) that aren't related to whether our forecast pipeline worked. Catch and log instead of crashing — restart/forecast/score parquet writes already succeeded at this point, the run should reach completion regardless. AWS env restore now runs after either branch (submit success or caught error).
…config Two fixes: 1. arrow::open_dataset partition listing was taking ~65min each over OSN because the prefix was the bucket root (flare/forecasts/parquet) and arrow had to enumerate every model_id x reference_date directory under the entire forecasts bucket to build the schema, even though the dplyr filter only wanted one partition. Push the partition keys into the prefix so arrow opens exactly one parquet file. The dplyr filter is now redundant and removed. Two sites (vera4cast aggregation block and score block, plus the past_forecasts variant). 2. update_run_config in v3.1-dev FLAREr has a config parameter not present in main. Internally it calls flare_put_file(config = config). Our caller wasn't passing config, so the inner call errored with 'argument config is missing'. Add config = config to the call.
… still resolve
Side effect of pointing arrow::open_dataset directly at a leaf
partition path (site_id=X/model_id=Y/reference_date=Z): partition
columns are embedded in the path, not in the parquet file. arrow only
synthesizes them when discovering partitions from the bucket root. At
the leaf path, the resulting tibble has only the parquet's own columns
- no site_id, model_id, reference_date.
Downstream dplyr summarize(.by = c(..., site_id, model_id, ...)) and
write_dataset(partitioning = c('site_id','model_id','reference_date'))
both need those columns. Add them back via mutate() with the values we
already have (the same values we used to construct the prefix).
Three sites: vera4cast aggregation forecast_df, score block forecast_df,
score block past_forecasts.
… testing Force get_run_config to ignore S3-leftover configure_run.yml and use local YAML on every invoke. Without this, each test run inherits the prior run's advanced forecast_start_datetime + restart_file pointer and tries to resume -- which exposes latent bugs in loop logic that don't manifest in a daily-cron production setting where each invoke runs a single iteration. Revert before merging to main: production runs rely on clean_start=FALSE (default) so the daily restart chain works.
Reverting the deployment-only override added in 045470e. Test repeatability will be handled by clearing the test sandbox's S3 state before each invoke instead of skipping the S3-resume branch in code. The set_up_simulation call returns to its default (clean_start=FALSE).
forecast_start_datetime: 2026-05-04 -- latest date with both noaa stage2 and vera4cast inflow_gefsClimAED data available. After iter 1 update_run_config advances to 2026-05-05; no data published for that date yet, so noaa_ready/inflow_ready flip FALSE and the while loop exits naturally. Mirrors the daily-cron production pattern where each cron firing runs exactly one iteration. start_datetime: 2026-05-01 -- 3-day spinup, well clear of the 2025-07-22 to 2025-07-25 stage3 met-data duplicate zone.
After update_run_config writes the advanced forecast_start_datetime to disk + S3, the outer-scope config in run_fcre_aed_forecast still holds the previous values. Sourced helpers on the next iteration (generate_inflow_forecast.R -> convert_vera4cast_inflow) read the stale value and write inflow partitions under the previous reference_date, while run_flare re-reads config from disk and looks under the new one -- producing arrow IOError 'No such file or directory' on the reference_date partition. Refreshing config via set_up_simulation right after update_run_config keeps the in-memory view aligned with on-disk state. combined_run_aed.R doesn't trip this because each cron firing runs the loop exactly once per process; this script runs multiple iterations in one process.
Iter 1 produces a restart NetCDF whose time vector spans [start_datetime, forecast_start_datetime + horizon]. After iter 1, the combined_run_aed.R-style carry-over math sets the next iteration's start_datetime to old_forecast_start_datetime - 4 days. With a 3-day spinup that target lands BEFORE the NetCDF's earliest time, so which(datetime == new_sd) returns integer(0) and generate_initial_conditions crashes with 'replacement has length zero' when indexing pars_restart with the empty restart_index. Bumping start_datetime from 2026-05-01 to 2026-04-30 gives a 4-day spinup; iter 1's NetCDF then includes 2026-04-30, so iter 2 finds its restart_index. Production daily-cron config doesn't trip this because its 80-day spinup leaves plenty of cushion.
…leanly FLAREr::check_noaa_present catches arrow IOError on a missing NOAA partition with a tryCatch whose handler calls message(error_message), passing the simpleError condition object. message() with a single condition argument signals the condition as-is, preserving its error class -- so the FaaSr executor's outer withCallingHandlers(error=) catches it and marks the action as failed instead of letting the function return FALSE. The team's daily-cron pattern never trips this because their config runs against past dates where stage2 NOAA is always present. Our chained-iteration pattern eventually advances past the published horizon and hits the catch arm. Wrap the two call sites with our own tryCatch that converts any error into noaa_ready=FALSE so the while-loop exits cleanly. Remove this workaround when upstream check_noaa_present is fixed to use message(conditionMessage(e)).
Only set GLM_PATH='GLM3r' when unset, so a container ENV (or any explicit pre-set value) wins. Pass local_path to flare_arrow_s3_bucket at the two forecasts_parquet read sites so the read location matches where write_forecast writes in local mode.
FLAREr's run_flare on v3.1-dev-netcdf-v2 reads config$model_settings$base_AED_phyto_pars_nml_file (with _file suffix) and passes it to update_phy_states_obs_mapping, which calls read_nml on it. Point the new field at aed2.nml, which contains the aed_phytoplankton block the function expects.
FLAREr's run_model on v3.1-dev-netcdf-v2 sets init_restart_from_file via glmtools::set_nml() before each GLM run; the call requires the variable to exist in the base nml. Adds the placeholder (=0) plus init_restart_fname to match the bundled aed/default templates.
Last NML field FLAREr's run_model writes via glmtools::set_nml that wasn't present in glm3.nml. Mirrors the bundled aed/default templates.
Previous defensive default could land on GLM3r in the FaaSr container, which doesn't have GLM3r installed. Check for /opt/glm/glm first; that's where the FaaSr container drops the embedded GLM binary. Log the resolved value at function entry to make the chain transparent.
fcre-targets-rs.csv is the remote-sensing (LST) target file — usually empty because cloud-free imagery is sparse. With it as obs_filename, create_obs_matrix returned all-NA observations, obs_count was 0 at every iter, and FLAREr's no-DA branch fired — which has a separate uninitialized-pars_corr bug. Using fcre-targets-insitu.csv (the rich daily-depth observations) matches Quinn's bundled aed config and keeps every iter on the DA branch.
It was the only non-vertical entry (multi_depth=0). v3.1-dev-netcdf-v2 adds a new code path for non-vertical obs that requires model_source / model_variable / model_depth_m columns plus a non_vertical_noise_config csv — neither of which our config has. Without those, extract_modeled errors with 'missing value where TRUE/FALSE needed'. Removing secchi sidesteps the path entirely and matches what the rest of the FaaSr test workflow exercises.
Three files brought in from FLARE-forecast/FLAREr@v3.1-dev-netcdf-v2's inst/extdata/configuration/aed/ (per Quinn's reference): - non_vertical_noise_config.csv (new file, verbatim from upstream) - non_vertical_noise_config_file key added to configure_flare_glm_aed.yml pointing at the new file - observations_config_aed.csv schema extended with model_source, model_variable, model_depth_m columns. Existing multi_depth=1 rows get NA in the new columns (no behavior change). secchi row restored using upstream's diagnostic/extc_coeff/1.0 pattern but keeping FCRE's Secchi_m_sample target name.
write_restart on v3.1-dev-netcdf-v2 returns .zip when GLM restart staging produces extra files (the default for this branch) and .nc otherwise. Hard-coding .nc in the next iteration's restart_file caused get_restart_file to 404 on iter 2 because iter 1 had uploaded the file as .zip. Capturing run_flare's return value and using basename() on whatever it actually wrote keeps the chain consistent regardless of which format the underlying write produced.
v3.1-dev-netcdf-v2's write_restart only saves model state for the timesteps listed in output_settings$restart_save_timesteps (default: just forecast_start_datetime). The next iter computes start_datetime = prev_fsd - 4 days, which is not in the saved set, so generate_initial_ conditions can't find the right offset and aborts. 'all' tells write_restart to save every timestep; cheap (5 small snapshots in the zip) and removes the offset-coupling between the loop's start_datetime math and the restart save list.
…lity Quinn's new write_restart on v3.1-dev-netcdf-v2 only persists GLM output timesteps in the zip — spinup_start (which is the date the team's '-4 days' pattern lands on) is never saved. generate_restart_initial_ conditions_from_zip then can't find that date and aborts with 'No GLM restart directory found for date X. Available dates: ...'. Shift to -3 days so the next iter's start_datetime lands on the earliest GLM-output date in the restart zip while still preserving a 4-day spinup window into the new forecast_start_datetime.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a single-action entry function (
workflows/glm_aed_flare_rs/run_fcre_aed_forecast.R)that wraps the FCRE GLM-AED forecast loop in a form deployable as a FaaSr
action. The existing standalone driver (
combined_run_aed.R) is preservedunchanged and continues to work.
Changes
workflows/glm_aed_flare_rs/run_fcre_aed_forecast.R— callableentry function with embedded forecast loop, soft time-budget for
long-running-action targets, and config refresh after
update_run_config.R/convert_vera4cast_inflow.R,R/generate_forecast_score_arrow.R,and
workflows/glm_aed_flare_rs/generate_inflow_forecast.R— accept anoptional
configargument; when supplied, route S3 reads/writes throughthe new
FLAREr::flare_*dispatch layer. Whenconfig = NULL(theexisting standalone-driver call pattern), behavior is unchanged.
v3.1-dev-netcdf-v2compatibility:configure_run.yml: addsuse_faasrflagconfigure_flare_glm_aed.yml:base_AED_phyto_pars_nml_file,non_vertical_noise_config_file,restart_save_timesteps: "all";obs_filename switched to insitu
glm3.nml:init_restart_from_file,init_restart_fname,restart_fnameplaceholdersobservations_config_aed.csv: model_source / model_variable /model_depth_m columns; secchi row dropped
non_vertical_noise_config.csv: new file (verbatim from upstreamFLAREr
v3.1-dev-netcdf-v2)