Skip to content

Switch over to UDAL#237

Merged
tomjemmett merged 28 commits intomainfrom
udal
Mar 18, 2026
Merged

Switch over to UDAL#237
tomjemmett merged 28 commits intomainfrom
udal

Conversation

@tomjemmett
Copy link
Copy Markdown
Member

No description provided.

sparkContext is deprecated, createDataFrame works fine instead
was getting spark errors using toPandas().toParquet(), so instead switching to using spark to write the parquet files.

the save_parquet() method cleans up the saved parquet, renaming the file to be something predictable, and removing the empty text files spark creates for the write transaction.
unserializable data is stored in the dataframe attributes; we can simply remove these items and to_parquet works

known issue, [spark-54068] which should be fixed in later releases of pyspark
@tomjemmett tomjemmett marked this pull request as ready for review March 17, 2026 15:56
@tomjemmett tomjemmett requested a review from a team as a code owner March 17, 2026 15:56
Copilot AI review requested due to automatic review settings March 17, 2026 15:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR switches the NHP data pipelines and configuration over to the UDAL environment by introducing environment-specific table name mappings and updating Databricks job bundle/workflow definitions accordingly.

Changes:

  • Replace the single table_names.py module with a table_names/ package containing a TableNames dataclass and environment-specific configurations (MLCSU vs UDAL), defaulting selection to UDAL.
  • Update data generation/extraction code to align with UDAL schemas and improve robustness (typing/assertions, casting, filtering invalid values).
  • Update Databricks asset bundle workflows/clusters for UDAL (workspace host, cluster settings, task ordering, new entry point wiring).

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/nhp/data/table_names/udal.py Adds UDAL-specific table and file path mappings.
src/nhp/data/table_names/mlcsu.py Extracts MLCSU table/file mappings into a dedicated config object.
src/nhp/data/table_names/table_names.py Introduces TableNames dataclass used across the codebase.
src/nhp/data/table_names/init.py Selects the active table mapping (currently hard-coded to UDAL).
src/nhp/data/table_names.py Removes the legacy monolithic table name selector/module.
src/nhp/data/reference/trust_types.py Hardens link parsing to avoid non-string href values.
src/nhp/data/reference/provider_catchments.py Switches to using the hes_apc dataset abstraction and filters acute providers.
src/nhp/data/reference/population_by_lsoa21.py Refactors population-by-LSOA generation to write directly to a Spark table.
src/nhp/data/reference/population_by_imd_decile.py Filters invalid “NA” population values before casting/aggregation.
src/nhp/data/reference/ods_trusts.py Improves typing and XML parsing safety for organisation records.
src/nhp/data/reference/icb_catchments.py Removes unnecessary type-ignore and keeps request paging logic.
src/nhp/data/reference/day_procedures.py Minor refactor of lambda formatting around binomial tests.
src/nhp/data/raw_data/mitigators/ip/efficiency/excess_beddays.py Switches CSV ingest to pandas→Spark DataFrame creation.
src/nhp/data/raw_data/mitigators/ip/activity_avoidance/smoking_related_admissions.py Switches SAF CSV ingest to pandas→Spark DataFrame creation.
src/nhp/data/raw_data/mitigators/ip/activity_avoidance/obesity_related_admissions.py Switches OAF CSV ingest to pandas→Spark DataFrame creation.
src/nhp/data/raw_data/mitigators/ip/activity_avoidance/evidence_based_interventions_msk.py Formatting-only adjustment to a diagnosis predicate.
src/nhp/data/raw_data/mitigators/ip/activity_avoidance/alcohol.py Replaces sparkContext JSON ingest with createDataFrame from dicts.
src/nhp/data/raw_data/inpatients.py Adds maternity episode typing enrichment and uses it in spell selection.
src/nhp/data/raw_data/ecds.py Normalises token/person identifiers and casts key columns to string.
src/nhp/data/population_projections/get_ons_files_2022.py Improves aria-label lambda formatting and asserts href type for NPP link.
src/nhp/data/nhp_datasets/local_authorities.py Removes unused/obsolete local authority successor logic.
src/nhp/data/model_data/generate_synthetic_data.py Strips heavy attrs before writing parquet to reduce output metadata.
src/nhp/data/inputs_data/save_parquet.py Adds Spark-based single-parquet-file writer utility for outputs.
src/nhp/data/inputs_data/rates.py Orders outputs deterministically and uses new parquet saver.
src/nhp/data/inputs_data/procedures.py Orders outputs deterministically and uses new parquet saver.
src/nhp/data/inputs_data/op/rates.py Filters out non-positive denominators after follow-up adjustment.
src/nhp/data/inputs_data/op/expat_repat.py Filters out non-positive aggregated counts before persisting.
src/nhp/data/inputs_data/ip/rates.py Filters out non-positive denominators after population join.
src/nhp/data/inputs_data/inequalities.py Filters out non-positive populations, fixes log message, uses parquet saver.
src/nhp/data/inputs_data/expat_repat.py Adds deterministic ordering and uses parquet saver.
src/nhp/data/inputs_data/diagnoses.py Orders outputs deterministically, updates docstring, uses parquet saver.
src/nhp/data/inputs_data/baseline.py Orders outputs deterministically and uses parquet saver.
src/nhp/data/inputs_data/age_sex.py Refactors aggregation flow and uses parquet saver for output.
src/nhp/data/default/opa.py Switches default view creation to use environment-specific table names.
src/nhp/data/default/ecds.py Switches default view creation to use environment-specific table names.
src/nhp/data/default/apc.py Switches default view creation to use environment-specific table names.
pyproject.toml Adds a console entry point for reference-icb_catchments.
databricks_workflows/nhp_data.yaml Removes webhook notifications and notification settings from job definition.
databricks_workflows/nhp_data-reference_data.yaml Reorders/extends reference tasks, unifies cluster key, updates cluster spec for UDAL.
databricks_workflows/nhp_data-population_projections.yaml Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-outpatients.yaml Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-inputs-provider.yaml Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-inputs-lad23cd.yaml Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-inpatients.yaml Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-extract_nhp_for_containers.yaml Unifies cluster key and updates cluster spec for UDAL.
databricks_workflows/nhp_data-ecds.yaml Unifies cluster key and updates cluster spec for UDAL.
databricks.yml Switches bundle workspace host(s) to UDAL and simplifies prod target config.
Comments suppressed due to low confidence (1)

databricks.yml:33

  • The prod bundle target no longer specifies run_as, root_path, or explicit permissions. This changes how/where production resources are deployed and who can manage them (often defaulting to the deploying user). Please confirm this is intentional for the UDAL move; otherwise restore the production identity/path/permissions settings needed for automated/protected prod deployments.
  prod:
    mode: production
    workspace:
      host: https://adb-6450443583208388.8.azuredatabricks.net


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/nhp/data/inputs_data/save_parquet.py Outdated
Comment thread src/nhp/data/reference/population_by_lsoa21.py Outdated
Comment thread src/nhp/data/raw_data/inpatients.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@yiwen-h yiwen-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through on call 18/03/2026 - all good - just check population_by_lsoa and ping me when done!

@tomjemmett tomjemmett merged commit 7c9433f into main Mar 18, 2026
3 checks passed
@tomjemmett tomjemmett deleted the udal branch March 18, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants