Switch over to UDAL by tomjemmett · Pull Request #237 · The-Strategy-Unit/nhp_data

tomjemmett · 2026-03-17T11:08:10Z

No description provided.

…in shared clusters

sparkContext is deprecated, createDataFrame works fine instead

was getting spark errors using toPandas().toParquet(), so instead switching to using spark to write the parquet files. the save_parquet() method cleans up the saved parquet, renaming the file to be something predictable, and removing the empty text files spark creates for the write transaction.

unserializable data is stored in the dataframe attributes; we can simply remove these items and to_parquet works known issue, [spark-54068] which should be fixed in later releases of pyspark

Copilot

Pull request overview

This PR switches the NHP data pipelines and configuration over to the UDAL environment by introducing environment-specific table name mappings and updating Databricks job bundle/workflow definitions accordingly.

Changes:

Replace the single table_names.py module with a table_names/ package containing a TableNames dataclass and environment-specific configurations (MLCSU vs UDAL), defaulting selection to UDAL.
Update data generation/extraction code to align with UDAL schemas and improve robustness (typing/assertions, casting, filtering invalid values).
Update Databricks asset bundle workflows/clusters for UDAL (workspace host, cluster settings, task ordering, new entry point wiring).

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/nhp/data/table_names/udal.py	Adds UDAL-specific table and file path mappings.
src/nhp/data/table_names/mlcsu.py	Extracts MLCSU table/file mappings into a dedicated config object.
src/nhp/data/table_names/table_names.py	Introduces `TableNames` dataclass used across the codebase.
src/nhp/data/table_names/init.py	Selects the active table mapping (currently hard-coded to UDAL).
src/nhp/data/table_names.py	Removes the legacy monolithic table name selector/module.
src/nhp/data/reference/trust_types.py	Hardens link parsing to avoid non-string href values.
src/nhp/data/reference/provider_catchments.py	Switches to using the `hes_apc` dataset abstraction and filters acute providers.
src/nhp/data/reference/population_by_lsoa21.py	Refactors population-by-LSOA generation to write directly to a Spark table.
src/nhp/data/reference/population_by_imd_decile.py	Filters invalid “NA” population values before casting/aggregation.
src/nhp/data/reference/ods_trusts.py	Improves typing and XML parsing safety for organisation records.
src/nhp/data/reference/icb_catchments.py	Removes unnecessary type-ignore and keeps request paging logic.
src/nhp/data/reference/day_procedures.py	Minor refactor of lambda formatting around binomial tests.
src/nhp/data/raw_data/mitigators/ip/efficiency/excess_beddays.py	Switches CSV ingest to pandas→Spark DataFrame creation.
src/nhp/data/raw_data/mitigators/ip/activity_avoidance/smoking_related_admissions.py	Switches SAF CSV ingest to pandas→Spark DataFrame creation.
src/nhp/data/raw_data/mitigators/ip/activity_avoidance/obesity_related_admissions.py	Switches OAF CSV ingest to pandas→Spark DataFrame creation.
src/nhp/data/raw_data/mitigators/ip/activity_avoidance/evidence_based_interventions_msk.py	Formatting-only adjustment to a diagnosis predicate.
src/nhp/data/raw_data/mitigators/ip/activity_avoidance/alcohol.py	Replaces `sparkContext` JSON ingest with `createDataFrame` from dicts.
src/nhp/data/raw_data/inpatients.py	Adds maternity episode typing enrichment and uses it in spell selection.
src/nhp/data/raw_data/ecds.py	Normalises token/person identifiers and casts key columns to string.
src/nhp/data/population_projections/get_ons_files_2022.py	Improves aria-label lambda formatting and asserts href type for NPP link.
src/nhp/data/nhp_datasets/local_authorities.py	Removes unused/obsolete local authority successor logic.
src/nhp/data/model_data/generate_synthetic_data.py	Strips heavy attrs before writing parquet to reduce output metadata.
src/nhp/data/inputs_data/save_parquet.py	Adds Spark-based single-parquet-file writer utility for outputs.
src/nhp/data/inputs_data/rates.py	Orders outputs deterministically and uses new parquet saver.
src/nhp/data/inputs_data/procedures.py	Orders outputs deterministically and uses new parquet saver.
src/nhp/data/inputs_data/op/rates.py	Filters out non-positive denominators after follow-up adjustment.
src/nhp/data/inputs_data/op/expat_repat.py	Filters out non-positive aggregated counts before persisting.
src/nhp/data/inputs_data/ip/rates.py	Filters out non-positive denominators after population join.
src/nhp/data/inputs_data/inequalities.py	Filters out non-positive populations, fixes log message, uses parquet saver.
src/nhp/data/inputs_data/expat_repat.py	Adds deterministic ordering and uses parquet saver.
src/nhp/data/inputs_data/diagnoses.py	Orders outputs deterministically, updates docstring, uses parquet saver.
src/nhp/data/inputs_data/baseline.py	Orders outputs deterministically and uses parquet saver.
src/nhp/data/inputs_data/age_sex.py	Refactors aggregation flow and uses parquet saver for output.
src/nhp/data/default/opa.py	Switches default view creation to use environment-specific table names.
src/nhp/data/default/ecds.py	Switches default view creation to use environment-specific table names.
src/nhp/data/default/apc.py	Switches default view creation to use environment-specific table names.
pyproject.toml	Adds a console entry point for `reference-icb_catchments`.
databricks_workflows/nhp_data.yaml	Removes webhook notifications and notification settings from job definition.
databricks_workflows/nhp_data-reference_data.yaml	Reorders/extends reference tasks, unifies cluster key, updates cluster spec for UDAL.
databricks_workflows/nhp_data-population_projections.yaml	Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-outpatients.yaml	Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-inputs-provider.yaml	Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-inputs-lad23cd.yaml	Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-inpatients.yaml	Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-extract_nhp_for_containers.yaml	Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-ecds.yaml	Unifies cluster key and updates cluster spec for UDAL.
databricks.yml	Switches bundle workspace host(s) to UDAL and simplifies prod target config.

Comments suppressed due to low confidence (1)

databricks.yml:33

The prod bundle target no longer specifies run_as, root_path, or explicit permissions. This changes how/where production resources are deployed and who can manage them (often defaulting to the deploying user). Please confirm this is intentional for the UDAL move; otherwise restore the production identity/path/permissions settings needed for automated/protected prod deployments.

  prod:
    mode: production
    workspace:
      host: https://adb-6450443583208388.8.azuredatabricks.net

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

yiwen-h

Went through on call 18/03/2026 - all good - just check population_by_lsoa and ping me when done!

…and inputs app

tomjemmett added 24 commits February 17, 2026 12:51

adds udal table name mappings

504193f

switch to use udal table names

2e57920

switch to use udal compute

f6cca81

refactor create_pop_by_lsoa21 to prevent cache size exceeding limits …

01f9a50

…in shared clusters

use table_names in creation of views

57ac625

rederives maternity_episode_type column

fdd85dd

fixes alcohol mitigator

b84e235

sparkContext is deprecated, createDataFrame works fine instead

removes circular dependency on table which is created later

5827332

fixes issue with loading reference csvs

8f35e30

fix table names

badd8e0

fix excess bed days mitigator

6ae78df

fix population by imd

0f3e12f

fixes for udal ecds table columns

74beac5

orders inputs data before writing

7d124fb

filters out problematic rows (denominator=0)

b48bdb7

ensures denominator is greater than 0

7d77160

fix type issues

73ddf05

fix lint issues

694f151

changes the prod target

b2a61c2

adds icb catchments creation to reference job

08aa61e

fixes issue with pandas to_parquet

cb8a6c3

unserializable data is stored in the dataframe attributes; we can simply remove these items and to_parquet works known issue, [spark-54068] which should be fixed in later releases of pyspark

fix typing issues

392bb5a

fix formatting issues

c0b26ca

tomjemmett marked this pull request as ready for review March 17, 2026 15:56

tomjemmett requested a review from a team as a code owner March 17, 2026 15:56

Copilot AI review requested due to automatic review settings March 17, 2026 15:56

Copilot started reviewing on behalf of tomjemmett March 17, 2026 15:56 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Comment thread src/nhp/data/inputs_data/save_parquet.py Outdated

Comment thread src/nhp/data/reference/population_by_lsoa21.py Outdated

Comment thread src/nhp/data/raw_data/inpatients.py Outdated

Apply suggestions from code review

c4b36bf

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

yiwen-h requested changes Mar 18, 2026

View reviewed changes

move drop table to before first for loop

fcdf7bc

tomjemmett force-pushed the udal branch from 514f5be to b4181f2 Compare March 18, 2026 15:04

tomjemmett added 2 commits March 18, 2026 15:46

configure the prod environment correctly

bea02c2

adds step to extract the data for use in the model docker containers …

86b056f

…and inputs app

tomjemmett force-pushed the udal branch from 6f9edfd to 86b056f Compare March 18, 2026 15:47

tomjemmett requested a review from yiwen-h March 18, 2026 15:55

yiwen-h approved these changes Mar 18, 2026

View reviewed changes

tomjemmett merged commit 7c9433f into main Mar 18, 2026
3 checks passed

tomjemmett deleted the udal branch March 18, 2026 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch over to UDAL#237

Switch over to UDAL#237
tomjemmett merged 28 commits intomainfrom
udal

tomjemmett commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiwen-h left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tomjemmett commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiwen-h left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants