Skip to content

Add entity-level HDFStore output format alongside h5py #567

@anth-volk

Description

@anth-volk

Motivation

The API v2 alpha and policyengine package's PolicyEngineUSDataset require entity-level Pandas HDFStore format (one table per entity: person, household, tax_unit, spm_unit, family, marital_unit). Currently, -us-data publishes only variable-centric h5py format (variable/year → array).

Converting between these formats via create_datasets() is extremely slow (~1hr+ per state) because it routes every variable through sim.calculate(), invoking the full simulation engine's dependency resolution for each variable × each year.

The UK avoids this: -uk-data publishes entity-level HDFStore directly, and policyengine-uk has extend_single_year_dataset() which uprates DataFrames via simple multiplication — no simulation engine needed.

Changes

1. HDFStore serialization in stacked_dataset_builder.py

After the existing h5py serialization, create_sparse_cd_stacked_dataset() now also:

  • Splits combined_df into entity DataFrames — classifies each variable by entity using system.variables[var].entity.key, deduplicates group entities by entity ID
  • Builds an uprating manifest — records each variable's entity and uprating parameter path (from system.variables[var].uprating)
  • Saves as HDFStore.hdfstore.h5 suffix alongside the existing .h5 file

2. Upload pipeline in publish_local_area.py

HDFStore files are uploaded to dedicated subdirectories:

  • states_hdfstore/
  • districts_hdfstore/
  • cities_hdfstore/

Both GCS and HuggingFace uploads are handled.

3. Comparison test

tests/test_format_comparison.py validates both formats contain identical data:

  • Compares all ~183 variables between h5py and HDFStore
  • Handles person-level (direct comparison) vs group-entity (unique value comparison)
  • Tests manifest presence and entity table completeness
  • Runnable as pytest or standalone CLI
pytest test_format_comparison.py --h5py-path NV.h5 --hdfstore-path NV.hdfstore.h5
# or
python -m policyengine_us_data.tests.test_format_comparison --h5py-path NV.h5 --hdfstore-path NV.hdfstore.h5

HDFStore structure

/person          → DataFrame (all person-entity vars + entity membership IDs)
/household       → DataFrame (deduplicated by household_id)
/tax_unit        → DataFrame (deduplicated by tax_unit_id)
/spm_unit        → DataFrame (deduplicated by spm_unit_id)
/family          → DataFrame (deduplicated by family_id)
/marital_unit    → DataFrame (deduplicated by marital_unit_id)
/_variable_metadata → DataFrame (variable, entity, uprating columns)
/_time_period    → Series (base year)

Future work

policyengine-us will add extend_single_year_dataset() to consume the HDFStore directly, enabling instant year projection without the simulation engine. The embedded uprating manifest makes each file self-describing and allows fallback when the package version doesn't exactly match the version used to build the dataset.

Branch

add-hdfstore-output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions