infra(bq): cluster all 6 cms_npd tables on most-queried column by aks129 · Pull Request #110 · FHIR-IQ/AINPI

aks129 · 2026-05-27T02:38:49Z

Adds clustering to all 6 tables in the `cms_npd` dataset on each table's most-queried `_*` column. Without clustering, BigQuery full-scans the table on every `WHERE _x = @y` query — for the `practitioner` table that's ~10 GB billed per call. Clustering reduces the same query to <100 MB billed.

Table	Cluster key	Reasoning
`practitioner`	`_npi`	NPI-based lookups (e.g. `/api/provider-search`)
`organization`	`_npi`	Same pattern — NPI lookups
`location`	`_managing_org_id`	Drill-down from org
`endpoint`	`_managing_org_id`	Drill-down from org
`practitioner_role`	`_practitioner_id`	Drill-down from practitioner
`organization_affiliation`	`_org_id`	Drill-down from org

Server-side action completed today (DDL: `CREATE TABLE … CLUSTER BY …` → `DROP TABLE` → `ALTER TABLE … RENAME TO …` for each).

Verified post-fix on production: `SELECT resource WHERE _npi = X` now bills 45 MB instead of 11,580 MB — ~257× cheaper for the realistic full-record query shape.

Verified production routes still 200 OK after the swap:

`/api/provider-search` returns the expected FHIR record correctly
`/api/npd/data-quality` 200 in 1.4s
`/api/npd/validation` 200 in 5.6s

What's in this PR

Just doc — adds a clustering-keys reference table to CLAUDE.md's BigQuery section + a sentence so future contributors don't accidentally add a hot-path route that filters by a non-cluster-key column without first reclustering OR setting a per-query `maximum_bytes_billed` cap (cap helpers are defined per the GCP cost-control checklist).

Cost impact

One-time recluster: <$1 total
Per-call cost reduction on `/api/provider-search`: ~257× cheaper for full-record reads, ~1,100× for thin reads

@npi

…nt reclustering) Server-side action completed today on the cms_npd dataset to permanently fix the runaway-cost root cause from Hemalatha's billing investigation: - practitioner CLUSTER BY _npi - organization CLUSTER BY _npi - location CLUSTER BY _managing_org_id - endpoint CLUSTER BY _managing_org_id - practitioner_role CLUSTER BY _practitioner_id - organization_affiliation CLUSTER BY _org_id Cause: /api/provider-search was running `SELECT resource FROM cms_npd.practitioner WHERE _npi = @npi LIMIT 5` 6,249 times in May 2026. The table had no clustering → every call full-scanned the ~10 GB JSON column → ~$0.058 per call → ~$190 of the $211 May invoice. Verified post-fix on production: same query now bills 45 MB instead of 11,580 MB (257× cheaper). /api/provider-search confirmed live + returning correct results (curl test against ainpi.dev). /api/npd/ data-quality + /api/npd/validation also 200 OK. The other 5 tables haven't shown runaway behavior YET but have the same exposure shape — preventive reclustering avoids the next incident on whichever route happens to add an unclustered hot-path query. Added a clustering reference table to CLAUDE.md's BigQuery section plus a warning that any new production route filtering by a non- clustered column needs either reclustering or an explicit maximum_bytes_billed cap. Total one-time recluster cost: <$1 (5 tables × ~$0.10 each). Future cost reduction at current traffic: ~$190 → ~$1.40/month on provider-search alone; comparable savings if any route ever joins or filters the other 5 tables by their cluster key. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vercel · 2026-05-27T02:38:54Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
ainpi	Building	Preview, Comment	May 27, 2026 2:38am

aks129 merged commit 016c77e into main May 27, 2026
6 of 7 checks passed

aks129 deleted the cluster-all-bq-tables branch May 27, 2026 02:38

vercel Bot deployed to Preview May 27, 2026 02:39 View deployment

aks129 changed the title ~~docs(bq): document clustering keys after 2026-05-27 cost-incident reclustering~~ infra(bq): cluster all 6 cms_npd tables on most-queried column May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra(bq): cluster all 6 cms_npd tables on most-queried column#110

infra(bq): cluster all 6 cms_npd tables on most-queried column#110
aks129 merged 1 commit into
mainfrom
cluster-all-bq-tables

aks129 commented May 27, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aks129 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this PR

Cost impact

Uh oh!

vercel Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aks129 commented May 27, 2026 •

edited

Loading