Add calibration package checkpointing, target config, and hyperparameter CLI by baogorek · Pull Request #538 · PolicyEngine/policyengine-us-data

baogorek · 2026-02-17T17:19:48Z

Fixes #533
Fixes #534
Fixes #558
Fixes #559
Fixes #562

Summary

Calibration package checkpointing: --build-only saves the expensive matrix build as a pickle, --package-path loads it for fast re-fitting with different hyperparameters or target sets
Target config YAML: Declarative exclusion rules (target_config.yaml) replace hardcoded target filtering; checked-in config reproduces the junkyard's 22 excluded groups
Hyperparameter CLI flags: --beta, --lambda-l2, --learning-rate are now tunable from the command line and Modal runner
Modal runner improvements: Streaming subprocess output, support for new flags
Documentation: docs/calibration.md covers all workflows (single-pass, build-then-fit, package re-filtering, Modal, portable fitting)
At-large district naming fix: H5 filenames for at-large districts now use XX-01 (conventional 1-based) instead of XX-00
GCS staging fix: GCS uploads moved from staging phase to promotion phase, so both GCS and HuggingFace are updated together during promote

Note: This branch includes commits from #537 (PUF impute) since the calibration pipeline depends on that work. The calibration-specific changes are in the top commit.

Test plan

pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py — CLI arg parsing tests
pytest policyengine_us_data/tests/test_calibration/test_target_config.py — target config filtering + package round-trip tests
Manual: make calibrate-build produces package, --package-path loads it and fits

🤖 Generated with Claude Code

policyengine_us_data/calibration/unified_calibration.py

policyengine_us_data/calibration/source_impute.py

docs/calibration.md

policyengine_us_data/calibration/source_impute.py

juaristi22

Minor comments, but generally LGTM, I was also able to run the calibration job in modal (after removing the ellipsis in unified_calibration.py)!

Small note: if im not mistaken this pr addressess issue #534. Seems like #310 was referenced in it as something that would be addressed together, but this pr does not save the calibration_log.csv among its outputs. Do we want to add it at this point?

juaristi22 · 2026-02-25T10:09:11Z

A couple questions on recent changes

Post-cloning PUF imputation remove, is this permanent?

Commit 49a1f66 ("Remove redundant --puf-dataset flag, add national targets") removed the ability to run PUF cloning inside the calibration pipeline, with the rationale that PUF cloning already happens upstream in extended_cps.py.

However, PR #516 specifically designed the pipeline so that PUF + QRF imputation runs after cloning and geography assignment, so that each clone gets geographically-informed imputations (with state_fips as a QRF predictor). As Max described it:

Each geographic clone gets geographically-appropriate PUF tax imputations instead of identical national-average ones duplicated everywhere. State is now a QRF predictor — California clones get California-like tax distributions.

With the current flow (extended_cps.py runs PUF once on base records → calibration pipeline clones 10x), all clones of the same household share identical PUF-imputed values, losing that variability benefit.

Are we planning to bring back the post-cloning PUF imputation once the calibration pipeline is stabilized? Or has the approach changed?

All target variable precomputation moved to county level, is this worth the large increase in computation?

In commit 02f8ad0, _build_county_values was introduced to handle county-dependent variables (specifically aca_ptc, since marketplace premiums vary by county). It ran alongside the existing _build_state_values which handled everything else via 51 state-level simulations — a two-tier design gated by COUNTY_DEPENDENT_VARS = {"aca_ptc"}.

Then in commit 40fb389 (a "checkpoint"), COUNTY_DEPENDENT_VARS was removed and all target variable precomputation was moved to _build_county_values. _build_state_values was demoted to only computing constraint variables.

This means the matrix builder now runs ~1,000-2,000 county-level simulations (one per unique county in the geography assignment) instead of 51 state-level simulations for variables like snap, household_count, etc. that don't depend on county. This is a ~40x increase in compute cost with no accuracy benefit for those variables.

Was this an intentional simplification, or a debugging shortcut that could be reverted to the two-tier approach? Restoring COUNTY_DEPENDENT_VARS and routing only aca_ptc (and any future county-dependent vars) through county precomputation would significantly reduce matrix build time.

baogorek · 2026-02-26T02:03:04Z

@juaristi22 thank you for your thoughtful and excellent comments

By the time this PR is done, I do want the calibration_log.csv to be saved once again! Actually, I do get it in this workflow that I run after make data:

 # 1. Build matrix locally (no GPU needed)
  source ~/envs/sep/bin/activate
  python -m policyengine_us_data.calibration.unified_calibration \
      --build-only \
      --skip-source-impute \
      --target-config policyengine_us_data/calibration/target_config.yaml \
      --package-output /tmp/calibration_package.pkl

  # 2. Push package to Modal volume
  modal volume put calibration-data /tmp/calibration_package.pkl calibration_package.pkl --force

  # 3. Fit on GPU from package
  modal run modal_app/remote_calibration_runner.py \
      --branch calibration-pipeline-improvements \
      --gpu T4 \
      --package-volume \
      --epochs 1000 \
      --beta 0.65 \
      --lambda-l0 1e-7 \
      --lambda-l2 1e-8 \
      --log-freq 500 \
      --target-config policyengine_us_data/calibration/target_config.yaml

That will fit the model on modal and drop a calibration_log.csv right on your local drive. I know one of the Issues was about actually storing it in an archive, and maybe that should be out of scope given this PR's complexity.

You're absolutely right about the counties and I've now made the per-county computation optional. I know that aca_ptc's formula does involve the county, so we may have to live with not getting perfect matches on from X * w to sim.calculate("aca_ptc").sum(). But I saw the perfect ratios with your code so I know we can do it.
Yes, you're right that I lost a bit of the vision with not imputing new PUF values for every clone. Given the brutal difficulty with getting X * w to match sim.calculate().sum() and the speed (which is still slow even after taking out the counties). From Claude:
- Would it make X*w consistency harder? Yes, significantly. Right now the matrix builder can precompute values once per state and reuse them across all 436 clones — because every clone of household #5000 has the same underlying values regardless of which state it's assigned to. The only things that change per clone are geography inputs and takeup draws.
- Issue created: Restore post-cloning PUF QRF re-imputation for geographic tax variation #560

…ter CLI - Add build-only mode to save calibration matrix as pickle package - Add target config YAML for declarative target exclusion rules - Add CLI flags for beta, lambda_l2, learning_rate hyperparameters - Add streaming subprocess output in Modal runner - Add calibration pipeline documentation - Add tests for target config filtering and CLI arg parsing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Modal calibration runner was missing --lambda-l0 passthrough. Also fix KeyError: Ellipsis when load_dataset() returns dicts instead of h5py datasets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Upload a pre-built calibration package to Modal and run only the fitting phase, skipping HuggingFace download and matrix build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Chunked training with per-target CSV log matching notebook format - Wire --log-freq through CLI and Modal runner - Create output directory if missing (fixes Modal container error) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Set verbose_freq=chunk so epoch counts don't reset each chunk - Rename: diagnostics -> unified_diagnostics.csv, epoch log -> calibration_log.csv (matches dashboard expectation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of creating a new Microsimulation per clone (~3 min each, 22 hours for 436 clones), precompute values for all 51 states on one sim object (~3 min total), then assemble per-clone values via numpy fancy indexing (~microseconds per clone). New methods: _build_state_values, _assemble_clone_values, _evaluate_constraints_from_values, _calculate_target_values_from_values. DEFAULT_N_CLONES raised to 436 for 5.2M record matrix builds. Takeup re-randomization deferred to future post-processing layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Modal runner: add --package-volume flag to read calibration package from a Modal Volume instead of passing 2+ GB as a function argument - unified_calibration: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments to prevent CUDA memory fragmentation during L0 backward pass - docs/calibration.md: rewrite to lead with lightweight build-then-fit workflow, document prerequisites, and add volume-based Modal usage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- target_config.yaml: exclude everything except person_count/age (~8,766 targets) to isolate fitting issues from zero-target and zero-row-sum problems in policy variables - target_config_full.yaml: backup of the previous full config - unified_calibration.py: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments to fix CUDA memory fragmentation during backward pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- apply_target_config: support 'include' rules (keep only matching targets) in addition to 'exclude' rules; geo_level now optional - target_config.yaml: 3-line include config replaces 90-line exclusion list for age demographics (person_count with age domain, ~8,784 targets) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The roth_ira_contributions target has zero row sum (no CPS records), making it impossible to calibrate. Remove it from target_config.yaml so Modal runs don't waste epochs on an unachievable target. Also adds `python -m policyengine_us_data.calibration.validate_package` CLI tool for pre-upload package validation, with automatic validation on --build-only runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rename w_district_calibration.npy and unified_weights.npy to calibration_weights.npy everywhere (HF paths, local defaults, docs) - Add upload_calibration_artifacts() to huggingface.py for atomic multi-file HF uploads (weights + blocks + logs in one commit) - Add --upload flag (replaces --upload-logs) and --trigger-publish flag to remote_calibration_runner.py - Add _trigger_repository_dispatch() for GitHub workflow auto-trigger - Remove dead _upload_logs_to_hf() and _upload_calibration_artifact() - Add scripts/upload_calibration.py CLI + make upload-calibration target - Update modal_app/README.md with new flags and artifact table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Chains make data, upload-dataset (API direct to HF), calibrate-modal (GPU fit + upload weights), and stage-h5s (build + stage H5s). Configurable via GPU, EPOCHS, BRANCH, NUM_WORKERS variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Delete 3 one-off scripts (diagnose_ky01, generate_test_data, migrate_versioned) - Move check_staging_sums to calibration module with CLI args - Move verify_county_fix to test_xw_consistency.py (pytest, @slow) - Inline upload_calibration.py into Makefile target - Add sanity_checks.py: structural integrity checks for H5 files - Add --sanity-only flag to validate_staging.py - Add Makefile targets: validate-staging, check-staging, check-sanity, upload-validation - Add validation_results.csv to upload_calibration_artifacts() log_files - Append 4 doc sections: takeup rerandomization, block seeding, X@w invariant, gating workflow - Add calibration.md to MyST TOC Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- unified_calibration: emit SOURCE_IMPUTED_PATH for runner to capture - remote_calibration_runner: upload source-imputed dataset to HF after build - local_area: prefer source-imputed dataset when building staged H5s - publish_local_area: same source-imputed preference - Improved logging in remote runner (banner format, push plan) - Added check_volume_package helper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

H5 files use variable_name/2024 (group → dataset), not flat keys. Use a _get() helper that resolves slash paths via f[path] instead of checking top-level f.keys(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add validate-staging job to local_area_publish.yaml that runs after staging, uploads results to HF, and posts summary to step summary - Add `make promote` target with auto-detected version from pyproject.toml - Fix validate_staging.py OOM: replace sim_cache dict with one-at-a-time loading, explicit del+gc.collect between states to prevent two sims coexisting in memory (failed on CO after CA) - Add per-state population logging and total weighted population check Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dal runner - Add git provenance (branch, commit, dirty flag, version, dataset/DB SHA checksums) to calibration package metadata and run config output - Print provenance banner on package load with staleness/branch warnings - Write JSON sidecar on Modal volume for lightweight provenance checks - Remote runner: remove package_bytes param, auto-upload to Modal volume via --package-path, show provenance on --prebuilt-matrices - Fix takeup rerandomization: move override after initial state/county setup to avoid poisoning base calculations; county-level saves/restores original takeup values between counties and clears cache after override - Add domain_variable: age to district person_count in target config - Show git provenance fields in validation report - Replace hardcoded RECORD_IDX in matrix masking tests with dynamic record selection to avoid brittleness when data/formulas change - Update docs for new --package-path upload behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add parallel national calibration pipeline that produces a sparse national US.h5 alongside the existing local-area H5 files. Both calibrations share the pre-built matrix and run in parallel. - Add prefix parameter to HF upload/download for national artifacts - Add --national flag to calibration runner (defaults lambda_l0=1e-4) - Add build_national_h5() and national worker support - Add coordinate_national_publish() and main_national() entrypoint - Add Makefile targets: calibrate-modal-national, calibrate-both, stage-national-h5, stage-all-h5s - Remove --prebuilt-matrices flag; volume-fit is now the default - Update pipeline target to run both calibrations in parallel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

`from policyengine_us_data import __version__` imports the submodule __version__.py rather than the string it defines. Changed to import from the module directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Comment out all targets except district-level age demographics - Rewrite build_national_h5 to collapse CD weights to household level instead of running 436 per-CD simulations - Add validate_national_h5.py script Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Uncomment all ~80 targets in target_config.yaml (district, state, national) - Wire geo_labels.json through remote calibration runner (parse, save, upload) - Add staging support for national H5 (upload_to_staging_hf instead of direct upload_local_area_file) - Add main_national_promote entrypoint for two-phase publish - Include prior uncommitted work: geo_labels rename, stacked_dataset_builder, publish_local_area refactor, takeup utils, huggingface upload improvements Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

These functions were dropped during merge conflict resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Worker commits to volume but coordinator's view is stale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace inline takeup draw loops in unified_matrix_builder.py (both the parallel worker path and the sequential clone path) with calls to the shared compute_block_takeup_for_entities() from utils/takeup.py. Remove deprecated functions from takeup.py that are no longer used: draw_takeup_for_geo, compute_entity_takeup_for_geo, apply_takeup_draws_to_sim, apply_block_takeup_draws_to_sim, and _build_entity_to_hh_index. Also remove the now-unused rerandomize_takeup function from unified_calibration.py. Simplify compute_block_takeup_for_entities signature by deriving state FIPS from block GEOID prefix instead of requiring a separate entity_state_fips parameter. Update tests to exercise the remaining shared functions directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove dead sim-based methods: _evaluate_constraints_entity_aware, _calculate_target_values, and calculate_spm_thresholds_for_cd - Delete duplicate class methods _evaluate_constraints_from_values and _calculate_target_values_from_values; update call sites to use the existing standalone functions with variable_entity_map - Fix count-vs-dollar classifier: replace substring heuristic in _get_uprating_info with endswith("_count"); use exact equality in validate_staging._classify_variable to prevent false positives - Add optional precomputed_rates parameter to apply_block_takeup_to_arrays to skip redundant load_take_up_rate calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

juaristi22 reviewed Feb 18, 2026

View reviewed changes

policyengine_us_data/calibration/unified_calibration.py Outdated Show resolved Hide resolved

policyengine_us_data/calibration/unified_calibration.py Outdated Show resolved Hide resolved

juaristi22 reviewed Feb 18, 2026

View reviewed changes

policyengine_us_data/calibration/source_impute.py Outdated Show resolved Hide resolved

juaristi22 reviewed Feb 18, 2026

View reviewed changes

docs/calibration.md Outdated Show resolved Hide resolved

juaristi22 reviewed Feb 18, 2026

View reviewed changes

policyengine_us_data/calibration/source_impute.py Show resolved Hide resolved

juaristi22 reviewed Feb 18, 2026

View reviewed changes

juaristi22 force-pushed the calibration-pipeline-improvements branch from 4c51b32 to 61523d8 Compare February 18, 2026 14:46

juaristi22 mentioned this pull request Feb 18, 2026

Category takeup rerandomization #540

Open

4 tasks

baogorek force-pushed the calibration-pipeline-improvements branch 2 times, most recently from 59b27a8 to 0a0f167 Compare February 19, 2026 23:07

juaristi22 marked this pull request as ready for review February 26, 2026 17:54

juaristi22 marked this pull request as draft February 26, 2026 17:54

baogorek mentioned this pull request Mar 3, 2026

Calibrate retirement contributions: targets, SS reconciliation, and QRF imputation #554

Merged

4 tasks

baogorek and others added 16 commits March 5, 2026 16:59

Ignore all calibration run outputs in storage/calibration/

f714dd0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add --lambda-l0 to Modal runner, fix load_dataset dict handling

baf798f

The Modal calibration runner was missing --lambda-l0 passthrough. Also fix KeyError: Ellipsis when load_dataset() returns dicts instead of h5py datasets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add --package-path support to Modal runner

eb9e75b

Upload a pre-built calibration package to Modal and run only the fitting phase, skipping HuggingFace download and matrix build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Create log directory before writing calibration log

4a0badf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add debug logging for CLI args and command in package path

6981c51

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch target config to finest-grain include (~18K targets)

7b53d29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix at-large district geoid mismatch (7 districts had 0 estimates)

7acc2f6

Add population-based initial weights for L0 calibration

d95efde

baogorek and others added 28 commits March 5, 2026 17:01

documentation

eb70f3a

flag

d45e8fe

changes to remote calibration runner

46874ae

Fix sanity_checks H5 key lookup for group/period structure

8f024e4

H5 files use variable_name/2024 (group → dataset), not flat keys. Use a _get() helper that resolves slash paths via f[path] instead of checking top-level f.keys(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix stage-h5s: add ::main entrypoint to modal run

64b0acd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

after county acknowledgement

bbd7c4a

Fix JSON serialization crash: __version__ resolved to module

35e55f8

`from policyengine_us_data import __version__` imports the submodule __version__.py rather than the string it defines. Changed to import from the module directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update upload_local_area_file docstring to list all subdirectories

e74998a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

late night work

da03be1

calibrated populaiton counts logging

b3de00f

saving column sums

33139b6

removing debugging logs

6e01d3c

unify build_h5 with wrappers

6b4e45e

removing wrappers

fb36a40

Add back save_geo_labels/load_geo_labels lost in rebase

52126f6

These functions were dropped during merge conflict resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add missing json import for save_geo_labels/load_geo_labels

337c543

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add volume reload before checking national H5 exists

5303f8d

Worker commits to volume but coordinator's view is stale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

juaristi22 force-pushed the calibration-pipeline-improvements branch from a676c66 to c201eb7 Compare March 5, 2026 11:32

Regenerate uv.lock after rebase on main

5aad6e3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add calibration package checkpointing, target config, and hyperparameter CLI#538

Add calibration package checkpointing, target config, and hyperparameter CLI#538
baogorek wants to merge 73 commits intomainfrom
calibration-pipeline-improvements

baogorek commented Feb 17, 2026 •

edited by juaristi22

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juaristi22 left a comment •

edited

Loading

Uh oh!

juaristi22 commented Feb 25, 2026 •

edited

Loading

Uh oh!

baogorek commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

baogorek commented Feb 17, 2026 • edited by juaristi22 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juaristi22 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juaristi22 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baogorek commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

baogorek commented Feb 17, 2026 •

edited by juaristi22

Loading

juaristi22 left a comment •

edited

Loading

juaristi22 commented Feb 25, 2026 •

edited

Loading