Fix enhanced_cps_2024.h5 being overwritten by sparse version#569
Merged
Fix enhanced_cps_2024.h5 being overwritten by sparse version#569
Conversation
create_sparse_ecps() was writing to enhanced_cps_2024.h5 instead of sparse_enhanced_cps_2024.h5, destroying the full dataset after enhanced_cps.py produced it. This caused all input variables (notably employment_income) to be lost, inflating the baseline poverty rate from ~14% to ~39% on policyengine.org. Introduced in commit 20572be ("Streamline data build"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis
added a commit
that referenced
this pull request
Mar 4, 2026
Checks employment_income, population counts, poverty rate, and income tax against wide bounds after each data build. Would have caught the enhanced CPS overwrite bug (PR #569) where employment_income_before_lsr was dropped, zeroing out all income and inflating poverty to 40%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tasks
MaxGhenis
added a commit
to MaxGhenis/policyengine-us-data
that referenced
this pull request
Mar 4, 2026
Five layers of defense against the bug class from PR PolicyEngine#569 where a lossy sparse rebuild overwrote the enhanced CPS, dropping employment_income_before_lsr and zeroing all income: 1. Pre-upload validation gate (upload_completed_datasets.py): - File size check (enhanced CPS must be >100MB, was 14MB when bad) - H5 structure check (critical variables must exist with data) - Aggregate stats check (employment income >5T, household count 100-200M) 2. Post-generation assertions (enhanced_cps.py, small_enhanced_cps.py): - Weight validation (no NaN, no negatives, reasonable sum) - Critical variable existence checks before save - Output file size verification after write 3. CI workflow safety (reusable_test.yaml): - Upload gated on success() so test failures block upload - Pre-upload H5 validation step checks structure before upload - employment_income_before_lsr explicitly checked 4. Makefile hardening: - Sparse file swap validates existence and size before/after - New validate-data target for standalone validation 5. Comprehensive test coverage: - Sparse dataset sanity tests (employment income, household count, poverty) - Direct employment income checks in enhanced and sparse test files - File size regression test (catches the 590MB→14MB shrink) - Small ECPS employment income and household count checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis
added a commit
that referenced
this pull request
Mar 4, 2026
Five layers of defense against the bug class from PR #569 where a lossy sparse rebuild overwrote the enhanced CPS, dropping employment_income_before_lsr and zeroing all income: 1. Pre-upload validation gate (upload_completed_datasets.py): - File size check (enhanced CPS must be >100MB, was 14MB when bad) - H5 structure check (critical variables must exist with data) - Aggregate stats check (employment income >5T, household count 100-200M) 2. Post-generation assertions (enhanced_cps.py, small_enhanced_cps.py): - Weight validation (no NaN, no negatives, reasonable sum) - Critical variable existence checks before save - Output file size verification after write 3. CI workflow safety (reusable_test.yaml): - Upload gated on success() so test failures block upload - Pre-upload H5 validation step checks structure before upload - employment_income_before_lsr explicitly checked 4. Makefile hardening: - Sparse file swap validates existence and size before/after - New validate-data target for standalone validation 5. Comprehensive test coverage: - Sparse dataset sanity tests (employment income, household count, poverty) - Direct employment income checks in enhanced and sparse test files - File size regression test (catches the 590MB→14MB shrink) - Small ECPS employment income and household count checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis
added a commit
that referenced
this pull request
Mar 4, 2026
Checks employment_income, population counts, poverty rate, and income tax against wide bounds after each data build. Would have caught the enhanced CPS overwrite bug (PR #569) where employment_income_before_lsr was dropped, zeroing out all income and inflating poverty to 40%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis
added a commit
that referenced
this pull request
Mar 4, 2026
Five layers of defense against the bug class from PR #569 where a lossy sparse rebuild overwrote the enhanced CPS, dropping employment_income_before_lsr and zeroing all income: 1. Pre-upload validation gate (upload_completed_datasets.py): - File size check (enhanced CPS must be >100MB, was 14MB when bad) - H5 structure check (critical variables must exist with data) - Aggregate stats check (employment income >5T, household count 100-200M) 2. Post-generation assertions (enhanced_cps.py, small_enhanced_cps.py): - Weight validation (no NaN, no negatives, reasonable sum) - Critical variable existence checks before save - Output file size verification after write 3. CI workflow safety (reusable_test.yaml): - Upload gated on success() so test failures block upload - Pre-upload H5 validation step checks structure before upload - employment_income_before_lsr explicitly checked 4. Makefile hardening: - Sparse file swap validates existence and size before/after - New validate-data target for standalone validation 5. Comprehensive test coverage: - Sparse dataset sanity tests (employment income, household count, poverty) - Direct employment income checks in enhanced and sparse test files - File size regression test (catches the 590MB→14MB shrink) - Small ECPS employment income and household count checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CRITICAL BUG FIX — policyengine.org has been showing ~39% baseline poverty rate instead of ~14%.
create_sparse_ecps()insmall_enhanced_cps.pywas writing toenhanced_cps_2024.h5instead ofsparse_enhanced_cps_2024.h5. Sincesmall_enhanced_cps.pyruns AFTERenhanced_cps.pyin the Modal data build pipeline, it was destroying the full enhanced CPS dataset and replacing it with a sparse version that drops input variables likeemployment_income.The broken file was then uploaded to HuggingFace as the default dataset, causing:
employment_incomevalues = $0Root cause
Introduced in commit 20572be ("Streamline data build: remove TEST_LITE/LOCAL_AREA_CALIBRATION, eliminate dense reweighting") which changed the output path from
sparse_enhanced_cps_2024.h5toenhanced_cps_2024.h5.Fix
Revert the output filename back to
sparse_enhanced_cps_2024.h5.After merge
CI will rebuild and re-upload the correct
enhanced_cps_2024.h5to HuggingFace.Test plan
enhanced_cps_2024.h5on HF has non-zero employment_income🤖 Generated with Claude Code