Skip to content

Harden data pipeline against corrupted dataset uploads#570

Open
MaxGhenis wants to merge 5 commits intomainfrom
add-dataset-sanity-tests
Open

Harden data pipeline against corrupted dataset uploads#570
MaxGhenis wants to merge 5 commits intomainfrom
add-dataset-sanity-tests

Conversation

@MaxGhenis
Copy link
Contributor

@MaxGhenis MaxGhenis commented Mar 4, 2026

Summary

Hardens the data pipeline with 5 layers of defense to prevent corrupted datasets from being uploaded to HuggingFace — the bug class from PR #569 where employment_income_before_lsr was dropped, zeroing all income and inflating poverty to 40%.

Defense layers

  1. Pre-upload validation gate (upload_completed_datasets.py): File size, H5 structure, and aggregate stats checks run before ANY upload
  2. Post-generation assertions (enhanced_cps.py, small_enhanced_cps.py): Weight validation, critical variable checks, output file size verification
  3. CI workflow safety (reusable_test.yaml): Upload gated on success(), pre-upload H5 validation step
  4. Makefile hardening: Sparse file swap validates existence/size before and after
  5. Comprehensive tests: Sparse dataset sanity tests, direct employment income checks, file size regression test

How each layer would have caught PR #569's bug

Layer Would catch? How
Pre-upload validation Yes File was 14MB (threshold: 100MB), employment_income_before_lsr was empty
Post-generation assertions Yes create_sparse_ecps() output missing critical variables
CI workflow Yes H5 check catches missing employment_income_before_lsr
Makefile swap Yes Sparse file didn't exist (wrong filename), swap would fail
Tests Yes test_ecps_employment_income_positive catches $0 employment income

Test plan

  • CI lint/format passes
  • Smoke tests pass
  • Full data build + sanity tests pass (will run on merge to main)
  • Verify pre-upload validation blocks upload when given a bad file

🤖 Generated with Claude Code

@MaxGhenis MaxGhenis changed the title Add dataset sanity tests for core variables Harden data pipeline against corrupted dataset uploads Mar 4, 2026
@MaxGhenis MaxGhenis force-pushed the add-dataset-sanity-tests branch from 4995d83 to 933427d Compare March 4, 2026 23:17
MaxGhenis and others added 4 commits March 4, 2026 18:38
Checks employment_income, population counts, poverty rate, and income
tax against wide bounds after each data build. Would have caught the
enhanced CPS overwrite bug (PR #569) where employment_income_before_lsr
was dropped, zeroing out all income and inflating poverty to 40%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Five layers of defense against the bug class from PR #569 where
a lossy sparse rebuild overwrote the enhanced CPS, dropping
employment_income_before_lsr and zeroing all income:

1. Pre-upload validation gate (upload_completed_datasets.py):
   - File size check (enhanced CPS must be >100MB, was 14MB when bad)
   - H5 structure check (critical variables must exist with data)
   - Aggregate stats check (employment income >5T, household count 100-200M)

2. Post-generation assertions (enhanced_cps.py, small_enhanced_cps.py):
   - Weight validation (no NaN, no negatives, reasonable sum)
   - Critical variable existence checks before save
   - Output file size verification after write

3. CI workflow safety (reusable_test.yaml):
   - Upload gated on success() so test failures block upload
   - Pre-upload H5 validation step checks structure before upload
   - employment_income_before_lsr explicitly checked

4. Makefile hardening:
   - Sparse file swap validates existence and size before/after
   - New validate-data target for standalone validation

5. Comprehensive test coverage:
   - Sparse dataset sanity tests (employment income, household count, poverty)
   - Direct employment income checks in enhanced and sparse test files
   - File size regression test (catches the 590MB→14MB shrink)
   - Small ECPS employment income and household count checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MaxGhenis MaxGhenis force-pushed the add-dataset-sanity-tests branch from 7ef864d to 6ac6adb Compare March 4, 2026 23:39
The variable was named optimised_weights_dense but the actual variable
from reweight() is optimised_weights.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant