Merged
Conversation
…in shared clusters
sparkContext is deprecated, createDataFrame works fine instead
was getting spark errors using toPandas().toParquet(), so instead switching to using spark to write the parquet files. the save_parquet() method cleans up the saved parquet, renaming the file to be something predictable, and removing the empty text files spark creates for the write transaction.
unserializable data is stored in the dataframe attributes; we can simply remove these items and to_parquet works known issue, [spark-54068] which should be fixed in later releases of pyspark
Contributor
There was a problem hiding this comment.
Pull request overview
This PR switches the NHP data pipelines and configuration over to the UDAL environment by introducing environment-specific table name mappings and updating Databricks job bundle/workflow definitions accordingly.
Changes:
- Replace the single
table_names.pymodule with atable_names/package containing aTableNamesdataclass and environment-specific configurations (MLCSU vs UDAL), defaulting selection to UDAL. - Update data generation/extraction code to align with UDAL schemas and improve robustness (typing/assertions, casting, filtering invalid values).
- Update Databricks asset bundle workflows/clusters for UDAL (workspace host, cluster settings, task ordering, new entry point wiring).
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/nhp/data/table_names/udal.py | Adds UDAL-specific table and file path mappings. |
| src/nhp/data/table_names/mlcsu.py | Extracts MLCSU table/file mappings into a dedicated config object. |
| src/nhp/data/table_names/table_names.py | Introduces TableNames dataclass used across the codebase. |
| src/nhp/data/table_names/init.py | Selects the active table mapping (currently hard-coded to UDAL). |
| src/nhp/data/table_names.py | Removes the legacy monolithic table name selector/module. |
| src/nhp/data/reference/trust_types.py | Hardens link parsing to avoid non-string href values. |
| src/nhp/data/reference/provider_catchments.py | Switches to using the hes_apc dataset abstraction and filters acute providers. |
| src/nhp/data/reference/population_by_lsoa21.py | Refactors population-by-LSOA generation to write directly to a Spark table. |
| src/nhp/data/reference/population_by_imd_decile.py | Filters invalid “NA” population values before casting/aggregation. |
| src/nhp/data/reference/ods_trusts.py | Improves typing and XML parsing safety for organisation records. |
| src/nhp/data/reference/icb_catchments.py | Removes unnecessary type-ignore and keeps request paging logic. |
| src/nhp/data/reference/day_procedures.py | Minor refactor of lambda formatting around binomial tests. |
| src/nhp/data/raw_data/mitigators/ip/efficiency/excess_beddays.py | Switches CSV ingest to pandas→Spark DataFrame creation. |
| src/nhp/data/raw_data/mitigators/ip/activity_avoidance/smoking_related_admissions.py | Switches SAF CSV ingest to pandas→Spark DataFrame creation. |
| src/nhp/data/raw_data/mitigators/ip/activity_avoidance/obesity_related_admissions.py | Switches OAF CSV ingest to pandas→Spark DataFrame creation. |
| src/nhp/data/raw_data/mitigators/ip/activity_avoidance/evidence_based_interventions_msk.py | Formatting-only adjustment to a diagnosis predicate. |
| src/nhp/data/raw_data/mitigators/ip/activity_avoidance/alcohol.py | Replaces sparkContext JSON ingest with createDataFrame from dicts. |
| src/nhp/data/raw_data/inpatients.py | Adds maternity episode typing enrichment and uses it in spell selection. |
| src/nhp/data/raw_data/ecds.py | Normalises token/person identifiers and casts key columns to string. |
| src/nhp/data/population_projections/get_ons_files_2022.py | Improves aria-label lambda formatting and asserts href type for NPP link. |
| src/nhp/data/nhp_datasets/local_authorities.py | Removes unused/obsolete local authority successor logic. |
| src/nhp/data/model_data/generate_synthetic_data.py | Strips heavy attrs before writing parquet to reduce output metadata. |
| src/nhp/data/inputs_data/save_parquet.py | Adds Spark-based single-parquet-file writer utility for outputs. |
| src/nhp/data/inputs_data/rates.py | Orders outputs deterministically and uses new parquet saver. |
| src/nhp/data/inputs_data/procedures.py | Orders outputs deterministically and uses new parquet saver. |
| src/nhp/data/inputs_data/op/rates.py | Filters out non-positive denominators after follow-up adjustment. |
| src/nhp/data/inputs_data/op/expat_repat.py | Filters out non-positive aggregated counts before persisting. |
| src/nhp/data/inputs_data/ip/rates.py | Filters out non-positive denominators after population join. |
| src/nhp/data/inputs_data/inequalities.py | Filters out non-positive populations, fixes log message, uses parquet saver. |
| src/nhp/data/inputs_data/expat_repat.py | Adds deterministic ordering and uses parquet saver. |
| src/nhp/data/inputs_data/diagnoses.py | Orders outputs deterministically, updates docstring, uses parquet saver. |
| src/nhp/data/inputs_data/baseline.py | Orders outputs deterministically and uses parquet saver. |
| src/nhp/data/inputs_data/age_sex.py | Refactors aggregation flow and uses parquet saver for output. |
| src/nhp/data/default/opa.py | Switches default view creation to use environment-specific table names. |
| src/nhp/data/default/ecds.py | Switches default view creation to use environment-specific table names. |
| src/nhp/data/default/apc.py | Switches default view creation to use environment-specific table names. |
| pyproject.toml | Adds a console entry point for reference-icb_catchments. |
| databricks_workflows/nhp_data.yaml | Removes webhook notifications and notification settings from job definition. |
| databricks_workflows/nhp_data-reference_data.yaml | Reorders/extends reference tasks, unifies cluster key, updates cluster spec for UDAL. |
| databricks_workflows/nhp_data-population_projections.yaml | Unifies cluster key and updates cluster spec for UDAL. |
| databricks_workflows/nhp_data-outpatients.yaml | Unifies cluster key and updates cluster spec for UDAL. |
| databricks_workflows/nhp_data-inputs-provider.yaml | Unifies cluster key and updates cluster spec for UDAL. |
| databricks_workflows/nhp_data-inputs-lad23cd.yaml | Unifies cluster key and updates cluster spec for UDAL. |
| databricks_workflows/nhp_data-inpatients.yaml | Unifies cluster key and updates cluster spec for UDAL. |
| databricks_workflows/nhp_data-extract_nhp_for_containers.yaml | Unifies cluster key and updates cluster spec for UDAL. |
| databricks_workflows/nhp_data-ecds.yaml | Unifies cluster key and updates cluster spec for UDAL. |
| databricks.yml | Switches bundle workspace host(s) to UDAL and simplifies prod target config. |
Comments suppressed due to low confidence (1)
databricks.yml:33
- The
prodbundle target no longer specifiesrun_as,root_path, or explicitpermissions. This changes how/where production resources are deployed and who can manage them (often defaulting to the deploying user). Please confirm this is intentional for the UDAL move; otherwise restore the production identity/path/permissions settings needed for automated/protected prod deployments.
prod:
mode: production
workspace:
host: https://adb-6450443583208388.8.azuredatabricks.net
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
yiwen-h
requested changes
Mar 18, 2026
Member
yiwen-h
left a comment
There was a problem hiding this comment.
Went through on call 18/03/2026 - all good - just check population_by_lsoa and ping me when done!
yiwen-h
approved these changes
Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.