Skip to content

Fix enhanced_cps_2024.h5 being overwritten by sparse version#569

Merged
MaxGhenis merged 1 commit intomainfrom
fix-enhanced-cps-overwrite
Mar 4, 2026
Merged

Fix enhanced_cps_2024.h5 being overwritten by sparse version#569
MaxGhenis merged 1 commit intomainfrom
fix-enhanced-cps-overwrite

Conversation

@MaxGhenis
Copy link
Contributor

Summary

CRITICAL BUG FIX — policyengine.org has been showing ~39% baseline poverty rate instead of ~14%.

create_sparse_ecps() in small_enhanced_cps.py was writing to enhanced_cps_2024.h5 instead of sparse_enhanced_cps_2024.h5. Since small_enhanced_cps.py runs AFTER enhanced_cps.py in the Modal data build pipeline, it was destroying the full enhanced CPS dataset and replacing it with a sparse version that drops input variables like employment_income.

The broken file was then uploaded to HuggingFace as the default dataset, causing:

  • All employment_income values = $0
  • Baseline SPM poverty rate inflated from ~14% to ~39%
  • All microsimulation results on policyengine.org incorrect

Root cause

Introduced in commit 20572be ("Streamline data build: remove TEST_LITE/LOCAL_AREA_CALIBRATION, eliminate dense reweighting") which changed the output path from sparse_enhanced_cps_2024.h5 to enhanced_cps_2024.h5.

Fix

Revert the output filename back to sparse_enhanced_cps_2024.h5.

After merge

CI will rebuild and re-upload the correct enhanced_cps_2024.h5 to HuggingFace.

Test plan

  • CI passes (data build + upload)
  • After upload, verify enhanced_cps_2024.h5 on HF has non-zero employment_income
  • Verify baseline poverty rate returns to ~14%

🤖 Generated with Claude Code

create_sparse_ecps() was writing to enhanced_cps_2024.h5 instead of
sparse_enhanced_cps_2024.h5, destroying the full dataset after
enhanced_cps.py produced it. This caused all input variables (notably
employment_income) to be lost, inflating the baseline poverty rate
from ~14% to ~39% on policyengine.org.

Introduced in commit 20572be ("Streamline data build").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 2db3bff into main Mar 4, 2026
6 checks passed
MaxGhenis added a commit that referenced this pull request Mar 4, 2026
Checks employment_income, population counts, poverty rate, and income
tax against wide bounds after each data build. Would have caught the
enhanced CPS overwrite bug (PR #569) where employment_income_before_lsr
was dropped, zeroing out all income and inflating poverty to 40%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis added a commit to MaxGhenis/policyengine-us-data that referenced this pull request Mar 4, 2026
Five layers of defense against the bug class from PR PolicyEngine#569 where
a lossy sparse rebuild overwrote the enhanced CPS, dropping
employment_income_before_lsr and zeroing all income:

1. Pre-upload validation gate (upload_completed_datasets.py):
   - File size check (enhanced CPS must be >100MB, was 14MB when bad)
   - H5 structure check (critical variables must exist with data)
   - Aggregate stats check (employment income >5T, household count 100-200M)

2. Post-generation assertions (enhanced_cps.py, small_enhanced_cps.py):
   - Weight validation (no NaN, no negatives, reasonable sum)
   - Critical variable existence checks before save
   - Output file size verification after write

3. CI workflow safety (reusable_test.yaml):
   - Upload gated on success() so test failures block upload
   - Pre-upload H5 validation step checks structure before upload
   - employment_income_before_lsr explicitly checked

4. Makefile hardening:
   - Sparse file swap validates existence and size before/after
   - New validate-data target for standalone validation

5. Comprehensive test coverage:
   - Sparse dataset sanity tests (employment income, household count, poverty)
   - Direct employment income checks in enhanced and sparse test files
   - File size regression test (catches the 590MB→14MB shrink)
   - Small ECPS employment income and household count checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis added a commit that referenced this pull request Mar 4, 2026
Five layers of defense against the bug class from PR #569 where
a lossy sparse rebuild overwrote the enhanced CPS, dropping
employment_income_before_lsr and zeroing all income:

1. Pre-upload validation gate (upload_completed_datasets.py):
   - File size check (enhanced CPS must be >100MB, was 14MB when bad)
   - H5 structure check (critical variables must exist with data)
   - Aggregate stats check (employment income >5T, household count 100-200M)

2. Post-generation assertions (enhanced_cps.py, small_enhanced_cps.py):
   - Weight validation (no NaN, no negatives, reasonable sum)
   - Critical variable existence checks before save
   - Output file size verification after write

3. CI workflow safety (reusable_test.yaml):
   - Upload gated on success() so test failures block upload
   - Pre-upload H5 validation step checks structure before upload
   - employment_income_before_lsr explicitly checked

4. Makefile hardening:
   - Sparse file swap validates existence and size before/after
   - New validate-data target for standalone validation

5. Comprehensive test coverage:
   - Sparse dataset sanity tests (employment income, household count, poverty)
   - Direct employment income checks in enhanced and sparse test files
   - File size regression test (catches the 590MB→14MB shrink)
   - Small ECPS employment income and household count checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis added a commit that referenced this pull request Mar 4, 2026
Checks employment_income, population counts, poverty rate, and income
tax against wide bounds after each data build. Would have caught the
enhanced CPS overwrite bug (PR #569) where employment_income_before_lsr
was dropped, zeroing out all income and inflating poverty to 40%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis added a commit that referenced this pull request Mar 4, 2026
Five layers of defense against the bug class from PR #569 where
a lossy sparse rebuild overwrote the enhanced CPS, dropping
employment_income_before_lsr and zeroing all income:

1. Pre-upload validation gate (upload_completed_datasets.py):
   - File size check (enhanced CPS must be >100MB, was 14MB when bad)
   - H5 structure check (critical variables must exist with data)
   - Aggregate stats check (employment income >5T, household count 100-200M)

2. Post-generation assertions (enhanced_cps.py, small_enhanced_cps.py):
   - Weight validation (no NaN, no negatives, reasonable sum)
   - Critical variable existence checks before save
   - Output file size verification after write

3. CI workflow safety (reusable_test.yaml):
   - Upload gated on success() so test failures block upload
   - Pre-upload H5 validation step checks structure before upload
   - employment_income_before_lsr explicitly checked

4. Makefile hardening:
   - Sparse file swap validates existence and size before/after
   - New validate-data target for standalone validation

5. Comprehensive test coverage:
   - Sparse dataset sanity tests (employment income, household count, poverty)
   - Direct employment income checks in enhanced and sparse test files
   - File size regression test (catches the 590MB→14MB shrink)
   - Small ECPS employment income and household count checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant