Skip to content

infra(bq): cluster all 6 cms_npd tables on most-queried column#110

Merged
aks129 merged 1 commit into
mainfrom
cluster-all-bq-tables
May 27, 2026
Merged

infra(bq): cluster all 6 cms_npd tables on most-queried column#110
aks129 merged 1 commit into
mainfrom
cluster-all-bq-tables

Conversation

@aks129
Copy link
Copy Markdown
Contributor

@aks129 aks129 commented May 27, 2026

Adds clustering to all 6 tables in the `cms_npd` dataset on each table's most-queried `_*` column. Without clustering, BigQuery full-scans the table on every `WHERE _x = @y` query — for the `practitioner` table that's ~10 GB billed per call. Clustering reduces the same query to <100 MB billed.

Table Cluster key Reasoning
`practitioner` `_npi` NPI-based lookups (e.g. `/api/provider-search`)
`organization` `_npi` Same pattern — NPI lookups
`location` `_managing_org_id` Drill-down from org
`endpoint` `_managing_org_id` Drill-down from org
`practitioner_role` `_practitioner_id` Drill-down from practitioner
`organization_affiliation` `_org_id` Drill-down from org

Server-side action completed today (DDL: `CREATE TABLE … CLUSTER BY …` → `DROP TABLE` → `ALTER TABLE … RENAME TO …` for each).

Verified post-fix on production: `SELECT resource WHERE _npi = X` now bills 45 MB instead of 11,580 MB — ~257× cheaper for the realistic full-record query shape.

Verified production routes still 200 OK after the swap:

  • `/api/provider-search` returns the expected FHIR record correctly
  • `/api/npd/data-quality` 200 in 1.4s
  • `/api/npd/validation` 200 in 5.6s

What's in this PR

Just doc — adds a clustering-keys reference table to CLAUDE.md's BigQuery section + a sentence so future contributors don't accidentally add a hot-path route that filters by a non-cluster-key column without first reclustering OR setting a per-query `maximum_bytes_billed` cap (cap helpers are defined per the GCP cost-control checklist).

Cost impact

  • One-time recluster: <$1 total
  • Per-call cost reduction on `/api/provider-search`: ~257× cheaper for full-record reads, ~1,100× for thin reads

…nt reclustering)

Server-side action completed today on the cms_npd dataset to permanently
fix the runaway-cost root cause from Hemalatha's billing investigation:

- practitioner            CLUSTER BY _npi
- organization            CLUSTER BY _npi
- location                CLUSTER BY _managing_org_id
- endpoint                CLUSTER BY _managing_org_id
- practitioner_role       CLUSTER BY _practitioner_id
- organization_affiliation CLUSTER BY _org_id

Cause: /api/provider-search was running `SELECT resource FROM
cms_npd.practitioner WHERE _npi = @npi LIMIT 5` 6,249 times in May 2026.
The table had no clustering → every call full-scanned the ~10 GB JSON
column → ~$0.058 per call → ~$190 of the $211 May invoice.

Verified post-fix on production: same query now bills 45 MB instead
of 11,580 MB (257× cheaper). /api/provider-search confirmed live +
returning correct results (curl test against ainpi.dev). /api/npd/
data-quality + /api/npd/validation also 200 OK.

The other 5 tables haven't shown runaway behavior YET but have the
same exposure shape — preventive reclustering avoids the next incident
on whichever route happens to add an unclustered hot-path query.

Added a clustering reference table to CLAUDE.md's BigQuery section
plus a warning that any new production route filtering by a non-
clustered column needs either reclustering or an explicit
maximum_bytes_billed cap.

Total one-time recluster cost: <$1 (5 tables × ~$0.10 each). Future
cost reduction at current traffic: ~$190 → ~$1.40/month on
provider-search alone; comparable savings if any route ever joins or
filters the other 5 tables by their cluster key.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
ainpi Building Building Preview, Comment May 27, 2026 2:38am

@aks129 aks129 merged commit 016c77e into main May 27, 2026
6 of 7 checks passed
@aks129 aks129 deleted the cluster-all-bq-tables branch May 27, 2026 02:38
@aks129 aks129 changed the title docs(bq): document clustering keys after 2026-05-27 cost-incident reclustering infra(bq): cluster all 6 cms_npd tables on most-queried column May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant