infra(bq): cluster all 6 cms_npd tables on most-queried column#110
Merged
Conversation
…nt reclustering) Server-side action completed today on the cms_npd dataset to permanently fix the runaway-cost root cause from Hemalatha's billing investigation: - practitioner CLUSTER BY _npi - organization CLUSTER BY _npi - location CLUSTER BY _managing_org_id - endpoint CLUSTER BY _managing_org_id - practitioner_role CLUSTER BY _practitioner_id - organization_affiliation CLUSTER BY _org_id Cause: /api/provider-search was running `SELECT resource FROM cms_npd.practitioner WHERE _npi = @npi LIMIT 5` 6,249 times in May 2026. The table had no clustering → every call full-scanned the ~10 GB JSON column → ~$0.058 per call → ~$190 of the $211 May invoice. Verified post-fix on production: same query now bills 45 MB instead of 11,580 MB (257× cheaper). /api/provider-search confirmed live + returning correct results (curl test against ainpi.dev). /api/npd/ data-quality + /api/npd/validation also 200 OK. The other 5 tables haven't shown runaway behavior YET but have the same exposure shape — preventive reclustering avoids the next incident on whichever route happens to add an unclustered hot-path query. Added a clustering reference table to CLAUDE.md's BigQuery section plus a warning that any new production route filtering by a non- clustered column needs either reclustering or an explicit maximum_bytes_billed cap. Total one-time recluster cost: <$1 (5 tables × ~$0.10 each). Future cost reduction at current traffic: ~$190 → ~$1.40/month on provider-search alone; comparable savings if any route ever joins or filters the other 5 tables by their cluster key. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds clustering to all 6 tables in the `cms_npd` dataset on each table's most-queried `_*` column. Without clustering, BigQuery full-scans the table on every `WHERE _x = @y` query — for the `practitioner` table that's ~10 GB billed per call. Clustering reduces the same query to <100 MB billed.
Server-side action completed today (DDL: `CREATE TABLE … CLUSTER BY …` → `DROP TABLE` → `ALTER TABLE … RENAME TO …` for each).
Verified post-fix on production: `SELECT resource WHERE _npi = X` now bills 45 MB instead of 11,580 MB — ~257× cheaper for the realistic full-record query shape.
Verified production routes still 200 OK after the swap:
What's in this PR
Just doc — adds a clustering-keys reference table to CLAUDE.md's BigQuery section + a sentence so future contributors don't accidentally add a hot-path route that filters by a non-cluster-key column without first reclustering OR setting a per-query `maximum_bytes_billed` cap (cap helpers are defined per the GCP cost-control checklist).
Cost impact