runwhen-contrib · stewartshea · May 26, 2026 · May 27, 2026 · May 28, 2026 · May 28, 2026
@@ -0,0 +1,71 @@
+---
+description: Contract for RunWhen Local discovery indexers (native SDK + CloudQuery). Enforces registry parity, collector registration, normalizer shape, and docs/test/generation-rule definition of done.
+globs: src/indexers/**,src/enrichers/azure.py,src/enrichers/gcp.py,src/enrichers/aws.py,scripts/azure/**,scripts/gcp/**
+alwaysApply: false
+---
+
+# Discovery indexer contract
+
+RunWhen Local discovers resources via native SDK indexers (`azureapi`,
+`gcpapi`, future `awsapi`) plus the legacy `cloudquery` indexer. The native
+indexers exist to remove the CloudQuery dependency entirely. When editing any
+indexer, enforce every rule below.
+
+For full workflows use the `extend-discovery-type` and `add-discovery-type`
+skills. Worked examples: `docs/architecture/{azure,gcp}-indexer-internals.md`.
+
+## CloudQuery parity is the public contract
+
+- The **CloudQuery table name** (e.g. `gcp_compute_instances`) is the
+  `resource_type` used in generation rules. Native indexers MUST normalize
+  payloads into the same CloudQuery-shaped dict so rules don't change when the
+  backend flips.
+- Every CloudQuery table for a platform MUST resolve via the registry by its
+  canonical name or a registered alias. Don't drop parity to simplify.
+
+## Generated registry — never hand-edit
+
+`src/indexers/*_resource_type_registry.yaml` is GENERATED. To change a mapping:
+
+1. Edit the overrides YAML (`scripts/<platform>/<platform>_resource_type_overrides.yaml`).
+2. Re-run `scripts/<platform>/sync_<platform>_resource_type_registry.py`.
+3. Commit both the overrides and the regenerated YAML; the sync `--dry-run`
+   must report 0 drift.
+
+Use `null` for the native type of tables with no API equivalent so the generic
+pass skips them.
+
+## Typed vs generic collectors
+
+- The **generic collector** (Azure `resources.list()`, GCP Cloud Asset
+  Inventory, AWS Cloud Control) provides broad parity.
+- **Typed collectors** are enrichment for high-value types. Register them in
+  `_TYPED_COLLECTORS` keyed by the canonical CloudQuery table name. A typed type
+  is automatically excluded from the generic pass — write each resource once.
+- Lazy-import SDK clients inside the collector so importing the module never
+  hard-requires the SDK.
+
+## Discovery is selective + anchored
+
+- Collect only types referenced by loaded generation rules, plus the mandatory
+  anchor (GCP `gcp_projects`, Azure `azure_resources_resource_groups`).
+- Emit the anchor first so child resources link to it at parse time.
+- Respect Level-of-Detail: scopes resolving to `none` are skipped entirely.
+
+## Normalizer shape
+
+Every resource flows: `normalize -> tag filter -> handler.parse_resource_data
+-> writer.add_resource`. Normalizers MUST map cloud labels/tags into the
+canonical `tags` field and preserve the qualifier fields the tag/hierarchy
+templates expect. Persist only via the `ResourceWriter`, never directly.
+
+## Definition of done (every indexer change)
+
+A change is incomplete until ALL of these are updated together:
+
+- [ ] Code + registry (regenerated, not hand-edited)
+- [ ] Unit tests: registry loader, normalizer+handler round-trip, selective dispatch
+- [ ] Docs: `docs/authoring/indexed-resources/<platform>.md` and the indexer-internals doc
+- [ ] Generation-rule contract verified: the type resolves by CloudQuery table
+      name and the normalized dict carries the fields the templates need
+- [ ] `cd src && python -m unittest discover -s indexers -p 'test_*.py'` passes
@@ -0,0 +1,38 @@
+---
+description: Guardrail for generated resource-type registry YAMLs and their sync scripts. Edit overrides + re-run sync; never hand-edit the generated file.
+globs: src/indexers/*_resource_type_registry.yaml,scripts/azure/*_overrides.yaml,scripts/gcp/*_overrides.yaml,scripts/azure/sync_*.py,scripts/gcp/sync_*.py,scripts/azure/*_cloudquery_tables.txt,scripts/gcp/*_cloudquery_tables.txt
+alwaysApply: false
+---
+
+# Resource-type registry is generated
+
+`src/indexers/*_resource_type_registry.yaml` is produced by
+`scripts/<platform>/sync_<platform>_resource_type_registry.py` from three
+inputs. **Do not hand-edit the generated YAML.**
+
+## To change a mapping
+
+1. Edit the right input:
+   - **Native-type wrong / missing** -> `*_overrides.yaml`
+     (`arm_type_overrides` for Azure, `cai_type_overrides` for GCP; use `null`
+     when there is no native API equivalent).
+   - **Alternate name** -> `aliases` in the overrides.
+   - **Rich payload wanted** -> add the table to `typed_collectors`.
+   - **Service -> API host remap** -> `service_api_hosts` (GCP).
+   - **New parity tables** -> refresh the `*_cloudquery_tables.txt` snapshot
+     from the CloudQuery hub (record source + version in its header).
+2. Regenerate:
+
+```bash
+python scripts/gcp/sync_gcp_resource_type_registry.py      # or azure/...
+```
+
+3. Review the `added / removed / changed` summary and the YAML diff; commit the
+   overrides AND the regenerated YAML together.
+
+## Parity invariant
+
+Every CloudQuery table for the platform must resolve via the registry by
+canonical name or alias. The sync `--dry-run` must report 0 drift after your
+change. Removing a table from parity requires an explicit reason, not
+convenience.
@@ -0,0 +1,158 @@
+---
+name: add-discovery-type
+description: >-
+  Stand up a brand-new RunWhen Local discovery platform / indexer from scratch
+  (a new cloud or resource source) following the native SDK pattern used by
+  azureapi and gcpapi. Use when asked to add a new platform, new cloud provider,
+  or new top-level discovery source. Covers the parity registry, indexer
+  modules, platform handler, tag/hierarchy templates, pipeline wiring, tests,
+  and docs.
+---
+
+# Add a new discovery type
+
+Use this to add a **new platform** (a new cloud provider or discovery source).
+To add a single resource type to an existing indexer, use the
+`extend-discovery-type` skill instead.
+
+`gcpapi` is the most recent end-to-end example — read
+`docs/architecture/gcp-indexer-internals.md` first; it documents every piece you
+are about to replicate. `azureapi` is the second reference. The mandate is to
+build native SDK indexers so CloudQuery can eventually be removed entirely.
+
+## Architecture (what you are building)
+
+```
+scripts/<platform>/
+├── <platform>_cloudquery_tables.txt        # parity source (CloudQuery hub table list)
+├── <platform>_resource_type_overrides.yaml # hand-curated overrides
+└── sync_<platform>_resource_type_registry.py  # registry generator
+
+src/indexers/
+├── <platform>_resource_type_registry.yaml  # GENERATED catalog (never hand-edit)
+├── <platform>_resource_type_registry.py     # read-only loader
+├── <platform>_common.py                      # credentials + scope resolution + tag filters
+├── <platform>api_normalizers.py              # raw SDK/API -> CloudQuery-shaped dict
+├── <platform>api_resource_types.py           # generic collector + typed collectors + specs
+├── <platform>api.py                          # orchestration loop
+└── test_<platform>*.py                       # unit tests
+
+src/enrichers/<platform>.py                   # PlatformHandler.parse_resource_data
+src/templates/<platform>-tags.yaml            # tag template
+src/templates/<platform>-hierarchy.yaml       # hierarchy template
+docs/architecture/<platform>-indexer-internals.md
+docs/authoring/indexed-resources/<platform>.md
+```
+
+## Core principles
+
+1. **Parity first.** Seed the registry from the upstream CloudQuery plugin's
+   table list (use the CloudQuery hub, not the old in-container plugin version).
+   Every CloudQuery table must resolve via the registry by canonical name/alias.
+2. **CloudQuery table name is the public contract.** Generation rules reference
+   it as `resource_type`. The native indexer normalizes payloads into the same
+   shape so rules don't change when the backend flips.
+3. **Generic collector is the workhorse**, typed collectors are enrichment. Pick
+   the broadest first-party API for the generic pass (Azure
+   `resources.list()`, GCP Cloud Asset Inventory, AWS Cloud Control API).
+4. **Selective, generation-rule-driven discovery.** Only collect types
+   referenced by loaded generation rules (plus the mandatory anchor), and
+   respect Level-of-Detail scoping.
+5. **Never hand-edit the generated registry YAML.**
+
+## Workflow
+
+```
+- [ ] 1. Obtain upstream CloudQuery table list -> scripts/<p>/<p>_cloudquery_tables.txt
+- [ ] 2. Author the overrides YAML (type remaps, aliases, typed flags, anchor, nulls)
+- [ ] 3. Write the sync script (heuristic table-name -> native-type mapping)
+- [ ] 4. Generate the registry YAML + write the read-only loader
+- [ ] 5. Add SDK deps to pyproject + regenerate lock in python:3.14-slim
+- [ ] 6. Write <p>_common.py (credentials, scope/LOD, tag filters)
+- [ ] 7. Write <p>api_normalizers.py (raw -> CloudQuery-shaped dict)
+- [ ] 8. Write <p>api_resource_types.py (generic collector + typed + specs)
+- [ ] 9. Write <p>api.py orchestrator (phases: bootstrap -> anchors -> typed -> generic)
+- [ ] 10. Add/confirm the enricher PlatformHandler.parse_resource_data
+- [ ] 11. Add src/templates/<p>-tags.yaml + <p>-hierarchy.yaml
+- [ ] 12. Wire pipeline: component.py, run.sh, run.py setting, cloudquery skip
+- [ ] 13. Tests: registry + normalizers + selective
+- [ ] 14. Docs: indexer-internals + indexed-resources + READMEs
+- [ ] 15. Run the full indexer test suite
+```
+
+### 1–4. Registry pipeline
+
+Mirror `scripts/gcp/`:
+
+- **Table list**: scrape the CloudQuery hub plugin table reference into a
+  `.txt` with a header noting source + version. This is the parity source.
+- **Overrides**: a YAML with native-type remaps, `aliases`, `typed_collectors`,
+  the **mandatory anchor** type (e.g. GCP `gcp_projects`, Azure
+  `azure_resources_resource_groups`), and `null` native types for tables with
+  no API equivalent.
+- **Sync script**: heuristic `<plugin>_<service>_<entity_plural>` ->
+  `<host>/<EntitySingularPascal>`, plus override application. Copy
+  `sync_gcp_resource_type_registry.py` and adapt the heuristic.
+- **Loader** (`<platform>_resource_type_registry.py`): copy the GCP loader;
+  provide `load_registry`, `find`, and a `find_by_<native>_type` lookup.
+
+### 6–9. Indexer modules
+
+- **`<platform>_common.py`**: resolve credentials (K8s secret, inline key, ADC /
+  env) and the account/project/subscription scope; provide `has_included_tags` /
+  `has_excluded_tags` label filters.
+- **`<platform>api_normalizers.py`**: convert raw SDK/API objects to the
+  CloudQuery-shaped dict the handler expects; hoist the full payload, map cloud
+  labels/tags to `tags`, normalize names/IDs/regions. Include a
+  `make_<anchor>_resource_data` helper for the synthesized anchor.
+- **`<platform>api_resource_types.py`**: a `*ResourceTypeSpec` dataclass, the
+  generic collector over the broad API, typed collectors (lazy-import SDK
+  clients), `_TYPED_COLLECTORS` keyed by canonical table name, and
+  `find_spec` / `find_spec_by_<native>_type`.
+- **`<platform>api.py`** `index(ctx)` phases:
+  1. Bootstrap: check `<platform>IndexerBackend`, resolve creds + scope, mirror
+     into the enricher, resolve LOD (skip scopes with LOD `none`).
+  2. Phase 0: emit the synthesized **anchor** resource first so children link to
+     it at parse time.
+  3. Phase 1: typed collectors for accessed types that have one.
+  4. Phase 2: one generic pass per scope, filtered to accessed native types not
+     already covered by a typed collector.
+  Every resource flows: normalize -> tag filter -> `handler.parse_resource_data`
+  -> `writer.add_resource`.
+
+### 10–11. Handler + templates
+
+- Confirm/extend `src/enrichers/<platform>.py` `parse_resource_data` consumes
+  the normalized dict and links children to the anchor.
+- Add `src/templates/<platform>-tags.yaml` and `-hierarchy.yaml` following the
+  **`template-tags-hierarchy` rule** exactly (hierarchy ends with
+  `resource_name`; emit `platform`, `resource_type`, `resource_name`,
+  `child_resource`; specificity-ordered `resource_name`; dedup loop).
+
+### 12. Pipeline wiring
+
+- `src/component.py`: add `"<platform>api"` to the `INDEXER` stage.
+- `src/run.sh`: include `<platform>api` in `COMPONENTS`.
+- `src/run.py`: coalesce `<platform>IndexerBackend` from `workspaceInfo` or
+  `WB_<PLATFORM>_INDEXER_BACKEND` into request data.
+- `src/indexers/cloudquery.py`: skip this platform when the native backend is
+  selected (mutual exclusion).
+
+### 13–14. Tests + docs (required)
+
+- Tests mirroring `test_gcp*.py`: registry loader contract, normalizer +
+  handler round-trip, selective discovery + generic-pass filter dispatch.
+- `docs/architecture/<platform>-indexer-internals.md` (link it from
+  `docs/architecture/README.md`).
+- `docs/authoring/indexed-resources/<platform>.md` documenting both backends,
+  the matchable types, and the stable generation-rule contract.
+- Update top-level `README.md` / docs indexes if the platform is newly listed.
+
+### 15. Verify
+
+```bash
+cd src && python -m unittest discover -s indexers -p 'test_*.py'
+python scripts/<platform>/sync_<platform>_resource_type_registry.py --dry-run
+```
+
+The sync dry-run must report 0 drift (registry matches overrides + table list).