Skip to content

infra: GCP cost-control architecture (budget alert + kill-switch Cloud Function + BQ per-query cap)#108

Merged
aks129 merged 1 commit into
mainfrom
cost-control-kill-switch-and-bq-cap
May 27, 2026
Merged

infra: GCP cost-control architecture (budget alert + kill-switch Cloud Function + BQ per-query cap)#108
aks129 merged 1 commit into
mainfrom
cost-control-kill-switch-and-bq-cap

Conversation

@aks129
Copy link
Copy Markdown
Contributor

@aks129 aks129 commented May 27, 2026

Adds the project's standing cost-control infrastructure for GCP. This is the canonical setup any future paid-API usage on the project must work within.

What's in this PR

1. infrastructure/kill-billing-function/ — Cloud Function source code

  • Listens to billing-alerts Pub/Sub topic
  • Calls cloudbilling.projects.updateBillingInfo({billingAccountName: ""}) when costAmount >= budgetAmount
  • Net effect: when monthly spend reaches the configured threshold, billing is disabled on the project within ~1 minute and all paid services stop
  • README has the deploy command + IAM grant + test/recovery procedures

2. BigQuery per-query cap (100 GB ≈ $0.50 per query)

  • frontend/src/lib/bigquery.ts: DEFAULT_MAX_BYTES_BILLED = 100_000_000_000; queryBigQuery() applies it by default
  • analysis/claims_sources/_cohorts.py: bq_job_config() helper returning a QueryJobConfig with the same cap
  • Current production BQ queries scan well under 25 GB → 4× headroom

3. CLAUDE.md updated with the cost-control summary + architecture review checklist so future contributors apply the same discipline.

Server-side prerequisites (completed outside this PR)

Action Done
Project-level budget alert with thresholds at 50/90/100%
Budget linked to billing-alerts Pub/Sub topic
APIs enabled: cloudfunctions, cloudbuild, pubsub, cloudbilling, serviceusage, run, eventarc
Service account kill-billing-sa@thematic-fort-453901-t7.iam.gserviceaccount.com created
Disabled Maps/Places APIs at project level (not needed for AINPI)

Manual user steps to finish kill-switch deploy

Classifier blocked the agent from deploying the function (it has billing-disable IAM — wants a human eyeball on the source before deploy). Two one-shot commands documented in infrastructure/kill-billing-function/README.md — least-privilege pattern (projectManager on the project + user on the billing account, so the SA can only disable billing for THIS project).

After deploy, smoke-test with a payload below the threshold (won't disable anything):

gcloud pubsub topics publish billing-alerts \
  --message='{"costAmount":0.50,"budgetAmount":10.00,"alertThresholdExceeded":0.5,"currencyCode":"USD"}'
gcloud functions logs read disable-billing-on-budget --gen2 --region=us-central1 --limit=10

Expected log: event: cost=\$0.50 budget=\$10.00 and under budget; no action.

Test plan

  • tsc --noEmit clean on bigquery.ts change
  • Production health verified post-changes: /, /findings, /articles/... all 200; /api/npd/validation (BQ-backed) returns expected JSON
  • After deploy: smoke-test Pub/Sub publish + log read (commands above)
  • Recovery path validated: gcloud billing projects link re-enables billing (also in README)

…er-query cap

Triggered by a $200 GCP bill from a Maps Platform API key (uid
0b7b11c7) created 2025-03-16 — unrelated to AINPI (zero references
in the codebase). Already-executed mitigation:

- DELETED the Maps Platform API key
- DISABLED 10 Maps/Places APIs at project level (places, maps-backend,
  geocoding-backend, etc.) — even cached copies of the key now error 403
- Created $10/mo budget alert at thresholds 50/90/100% on billing
  account 01B58B-9C267D-ECC805
- Linked budget to Pub/Sub topic
  projects/thematic-fort-453901-t7/topics/billing-alerts

This commit adds the code-side controls:

1. Cloud Function billing kill-switch (`infrastructure/kill-billing-function/`):
   - main.py listens to billing-alerts Pub/Sub topic, disables billing
     when costAmount >= budgetAmount
   - README has the deploy command + the one-time IAM grant the agent
     classifier blocked (billing.projectManager on the billing account)
   - Service account already created: kill-billing-sa@...
   - Pub/Sub topic + service account already exist; only deploy + IAM
     grant remain (commands in the README)

2. BigQuery per-query maximum-bytes-billed cap:
   - frontend/src/lib/bigquery.ts: DEFAULT_MAX_BYTES_BILLED = 100 GB
     (~$0.50 per query). queryBigQuery() applies it by default; opt-out
     by passing { maximumBytesBilled: <larger> }
   - analysis/claims_sources/_cohorts.py: bq_job_config() helper that
     returns a QueryJobConfig with the same cap. Future Python scripts
     should pass job_config=bq_job_config() on every BQ query

3. CLAUDE.md updated with the cost-control summary + pointers to the
   new helpers so future contributors don't accidentally bypass them.

Manual user steps required to complete kill-switch (commands in
infrastructure/kill-billing-function/README.md):
  a. gcloud billing accounts add-iam-policy-binding ... role=billing.projectManager
  b. cd infrastructure/kill-billing-function && gcloud functions deploy ...

Production verified healthy after Maps disable: /, /findings,
/articles/eight-years-post-exclusion all 200; /api/npd/validation (BQ-
backed) returns expected JSON.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
ainpi Building Building Preview, Comment May 27, 2026 1:45am

@aks129 aks129 merged commit 2024e12 into main May 27, 2026
6 of 7 checks passed
@aks129 aks129 deleted the cost-control-kill-switch-and-bq-cap branch May 27, 2026 01:45
@aks129 aks129 changed the title cost-control: $10/mo budget alert + kill-switch infra + BQ per-query cap infra: GCP cost-control architecture (budget alert + kill-switch Cloud Function + BQ per-query cap) May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant