-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Motivation
The API v2 alpha and policyengine package's PolicyEngineUSDataset require entity-level Pandas HDFStore format (one table per entity: person, household, tax_unit, spm_unit, family, marital_unit). Currently, -us-data publishes only variable-centric h5py format (variable/year → array).
Converting between these formats via create_datasets() is extremely slow (~1hr+ per state) because it routes every variable through sim.calculate(), invoking the full simulation engine's dependency resolution for each variable × each year.
The UK avoids this: -uk-data publishes entity-level HDFStore directly, and policyengine-uk has extend_single_year_dataset() which uprates DataFrames via simple multiplication — no simulation engine needed.
Changes
1. HDFStore serialization in stacked_dataset_builder.py
After the existing h5py serialization, create_sparse_cd_stacked_dataset() now also:
- Splits
combined_dfinto entity DataFrames — classifies each variable by entity usingsystem.variables[var].entity.key, deduplicates group entities by entity ID - Builds an uprating manifest — records each variable's entity and uprating parameter path (from
system.variables[var].uprating) - Saves as HDFStore —
.hdfstore.h5suffix alongside the existing.h5file
2. Upload pipeline in publish_local_area.py
HDFStore files are uploaded to dedicated subdirectories:
states_hdfstore/districts_hdfstore/cities_hdfstore/
Both GCS and HuggingFace uploads are handled.
3. Comparison test
tests/test_format_comparison.py validates both formats contain identical data:
- Compares all ~183 variables between h5py and HDFStore
- Handles person-level (direct comparison) vs group-entity (unique value comparison)
- Tests manifest presence and entity table completeness
- Runnable as pytest or standalone CLI
pytest test_format_comparison.py --h5py-path NV.h5 --hdfstore-path NV.hdfstore.h5
# or
python -m policyengine_us_data.tests.test_format_comparison --h5py-path NV.h5 --hdfstore-path NV.hdfstore.h5HDFStore structure
/person → DataFrame (all person-entity vars + entity membership IDs)
/household → DataFrame (deduplicated by household_id)
/tax_unit → DataFrame (deduplicated by tax_unit_id)
/spm_unit → DataFrame (deduplicated by spm_unit_id)
/family → DataFrame (deduplicated by family_id)
/marital_unit → DataFrame (deduplicated by marital_unit_id)
/_variable_metadata → DataFrame (variable, entity, uprating columns)
/_time_period → Series (base year)
Future work
policyengine-us will add extend_single_year_dataset() to consume the HDFStore directly, enabling instant year projection without the simulation engine. The embedded uprating manifest makes each file self-describing and allows fallback when the package version doesn't exactly match the version used to build the dataset.
Branch
add-hdfstore-output