Raise clear error when load step finds empty transformed extract#223
Merged
Conversation
Tolerate empty meters.json.gz / reads.json.gz files in the S3 and local output controllers (return []), then explicitly raise at the start of load_transformed if either list is empty. The exception message tells the operator that the most likely cause is a backfill DAG that has reached the vendor's data floor, and gives the exact decommission command. Previously, an empty reads file caused a cryptic JSONDecodeError deep in json.decoder, which fired the failure notifier without surfacing the actual cause. The existing pre-extract check at base.py:652 structurally cannot detect this case when the vendor's earliest reading sits above the configured backfill min_date, so MIN(flowtime) from existing readings never reaches min_date and no exception fires. This is the case Crescent (xylem_datalake) has been hitting since the backfill DAG marched past 2023-01-01: account list always non-empty, reads file empty for pre-floor chunks, JSONDecodeError on every run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When
load_transformedencounters an emptyreads.json.gz(ormeters.json.gz) at S3, it crashes inside the_default_decoder.decode()chain withJSONDecodeError: Expecting value: line 1 column 1 (char 0). The failure notifier fires, but the alert is cryptic — no signal as to what the underlying state is.This has been happening every 2 hours on the
cadc_crescent-ami-meter-read-dag-backfill-2023-01-01-2026-04-23DAG since the backfill marched past the vendor's earliest reading (2023-01-01 16:00 UTC). The xylem_datalake adapter queries theaccounttable with no date filter, so the meters file is always non-empty (~7,750 accounts); thewater_intervals/water_registersqueries are date-filtered, so for pre-floor chunks they return zero rows and the reads file is empty.Why the existing check doesn't catch this
_calculate_backfill_rangeatamiadapters/adapters/base.py:652-655raises "consider removing this backfill" whenend <= min_date(whereend = MIN(flowtime)from existing readings). For Crescent,MIN(flowtime) = 2023-01-01 16:00 UTCand the configuredmin_date = 2023-01-01 00:00—end > min_dateforever, because the vendor's data floor sits 16 hours above the configured floor. The check is structurally incapable of firing for any org whose vendor floor is strictly above its configuredmin_date. The PR #222 review flagged this completion edge case at the time.A stateless pre-extract check cannot detect "vendor has no more data" — that signal only exists after asking the vendor. The natural place for the check is post-extract.
Changes
amiadapters/outputs/s3.pyandamiadapters/outputs/local.py:read_transformed_metersandread_transformed_meter_readsreturn[]instead of crashing when the file is empty (just defensive — empty input should never produceJSONDecodeError).amiadapters/adapters/base.py: inload_transformed, if eithermetersorreadsis empty, raise a clear exception with the decommission CLI suggestion. Fires the existing failure notifier with an alert message that tells the operator what to do.The exception message handles both completion and outage cases:
python cli.py config remove-backfill ..."Per-adapter safety check
I spot-checked each adapter to make sure no live path produces empty meters+reads under healthy steady-state operation:
load_from_fileonmeters_and_reads.jsonalready errors in transform withoutallow_empty=TruelatestReadingallow_empty=Truefor sub-files. Realistically a 2-day Oracle window with 0 reads on an active utility is highly unlikelyIf the metersense / xylem_moulton_niguel adapters end up firing false-positive alerts in practice, the check can be made adapter-aware in a follow-up. The strict check is the safe default — water-meter telemetry is continuous and zero rows in any reasonable window is worth alerting on.
Test plan
test/amiadapters/outputs/test_s3.py— 2 new tests for empty-file return-[]behaviortest/amiadapters/outputs/test_local.py— 2 new tests for the same inLocalTaskOutputControllertest/amiadapters/test_base.py— 4 new tests coveringload_transformedhappy path + empty meters / empty reads / both emptyJSONDecodeError, then decommission the backfill DAG