Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions .claude/commands/test-vpc-agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
description: Run the VPC-agent end-to-end test loop against the local FastAPI mock backend (Kind or direct-celery)
argument-hint: "[--kind | --code] [--scenarios path/to/file.json] [--timeout 180] [--manual]"
---

# Run the VPC-agent end-to-end test loop

You are running the test suite for the local VPC agent against the FastAPI mock backend at `mock_backend/`. The mock simulates DrDroid cloud — it issues a bearer token, records every inbound request as JSONL, and lets you seed connection-test + playbook-task scenarios.

User args (may be empty): `$ARGUMENTS`

## Step 0: pick the path

Two paths exist. Choose based on what the user is changing:

- **`--kind` (default when uncertain on a branch that touches Dockerfile, base image, requirements.txt, helm/, or deploy_local.sh):** runs the *real* agent Docker image inside Kind alongside a sibling mock pod. Use this for image migrations (e.g. Debian→Alpine), helm changes, dep upgrades that affect runtime.
- **`--code` (default for pure Python changes — connector logic, playbook tasks, business code):** runs celery directly on the host against a host-bound mock. Faster (~30s vs minutes for image build).

If the user passed `--manual`, skip both and print the three-terminal manual flow from `mock_backend/README.md`. Stop.

If neither flag is set, infer: `git diff main --name-only | grep -E '^(Dockerfile|requirements\.txt|helm/|deploy_local\.sh|mock_backend/Dockerfile)'` — any hit ⇒ default `--kind`, else default `--code`.

## Step 1: preflight

Path-specific. Run the relevant checks in parallel.

**For `--kind`:**
- `kind get clusters` — check the binary works; cluster will be created if missing.
- `kubectl version --client` — fail fast if missing.
- `docker info` — Docker Desktop must be running.
- `cat credentials/secrets.yaml | head -5` from the repo root — should be non-empty. Empty file means scenarios that reference real connectors will fail with "Connector not found". Warn but proceed.

**For `--code`:**
- `redis-cli ping` — must return `PONG`. If not, suggest `docker run -d --rm --name drd-redis -p 6379:6379 redis:alpine` and ask before running.
- `cat credentials/secrets.yaml | head -5` (same as above).
- `lsof -i :8080` — must be free. If taken, pass `MOCK_BACKEND_PORT=<free>` through.
- `python -c 'import django, celery'` from the repo root — agent Python deps importable. If not, ask the user about `pip install -r requirements.txt`.

## Step 2: run the loop

### `--kind`

```bash
cd /Users/dipeshmittal/drdroid/drd-vpc-agent
./mock_backend/deploy_kind.sh
```

Pass `--scenarios <path>` and `--timeout <n>` through. The script:
1. ensures Kind cluster `drd-local` + `drdroid` namespace
2. builds `mock_backend/Dockerfile` and `kind load`s it
3. applies `mock_backend/k8s/manifests.yaml` (Deployment + Service)
4. waits for readiness, port-forwards `:18080`
5. mints a token via `/admin/tokens`
6. shells out to `./deploy_local.sh <token>` with `DRD_CLOUD_API_HOST=http://drd-vpc-agent-mock.drdroid.svc.cluster.local:8080` so agent pods talk to the mock over cluster DNS
7. seeds scenarios + runs `e2e.py`

The user does NOT bring a token — the mock issues it. `--reset-only` tears down the agent helm release and mock manifest.

### `--code`

```bash
cd /Users/dipeshmittal/drdroid/drd-vpc-agent
./mock_backend/run_e2e.sh
```

Same shape (start mock → mint token → boot agent → seed → validate) but agent runs as celery host processes. Pass `SCENARIOS_FILE=<path>` and `E2E_TIMEOUT=<n>` through.

Both scripts stream to stderr. Let them run; don't wrap in another background subprocess.

## Step 3: report

On PASS: one or two lines.

On FAIL: open the relevant logs.

**For `--kind`:**
- `kubectl -n drdroid get pods` — check pod state.
- `kubectl -n drdroid logs deploy/drd-vpc-agent-mock --tail 100`
- `kubectl -n drdroid logs -l app=celery-worker --tail 100`
- `kubectl -n drdroid logs -l app=celery-beat --tail 100`
- `kubectl -n drdroid describe pod <crashed-pod>` if any pod is in CrashLoopBackOff.
- While `deploy_kind.sh` is running, the mock admin API is up at `:18080` — `curl -s http://127.0.0.1:18080/admin/recorded` for buckets, `curl -s http://127.0.0.1:18080/admin/recorded/<bucket>` for events.

**For `--code`:**
- `mock_backend/data/mock-backend.log`
- `mock_backend/data/agent-logs/{beat,worker,worker-exec}.log`
- `mock_backend/data/recorded/*.jsonl` — the actual wire traffic.

Common failure modes:
- Agent pod CrashLoopBackOff right after `deploy_local.sh`: usually `secrets.yaml` is empty/malformed, or the alpine image is missing a binary the agent runtime needs (this is exactly what we're testing for on `experiment/alpine-base` — surface the traceback; don't try to fix it).
- "agent did not poll X": worker isn't draining. Check broker connectivity (redis pod for kind; host redis for code path); check worker log for traceback.
- "playbook_tasks executed without errors: FAIL" with `Connector not found`: scenario's `connector_name` isn't in `credentials/secrets.yaml`. Tell the user; don't edit secrets yourself.
- 401 in mock log: token env didn't reach the agent. For kind, check `kubectl -n drdroid get secret drd-cloud-secret -o yaml | base64 -D` (token field). For code, check the env exported by `run_e2e.sh`.

Don't speculate beyond the logs. If genuinely ambiguous after reading them, ask the user.

## Notes

- The test harness lives at `mock_backend/`. Code is checked in; `mock_backend/data/` is gitignored.
- This skill exists because the agent talks to `DRD_CLOUD_API_HOST` for ping/registration/connection tests/asset metadata/playbook tasks, and exercising those without mocking means hitting production.
- If the user wants to add new scenarios, point them at `mock_backend/scenarios/default.json` (commented inline) or `mock_backend/README.md`.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -182,3 +182,8 @@ helm/values.local.yaml
# Local copy of debug toolkit for testing
drdroid-debug-toolkit/
deploy_local.sh

# Per-user Claude Code settings (project-scoped slash commands under
# .claude/commands/ should be committed; this file holds local permission
# allowlists that vary by machine).
.claude/settings.local.json
5 changes: 5 additions & 0 deletions mock_backend/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
data/
__pycache__/
*.pyc
.venv/
venv/
13 changes: 13 additions & 0 deletions mock_backend/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.12-alpine

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py storage.py ./

ENV PYTHONUNBUFFERED=1
EXPOSE 8080

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080", "--log-level", "info"]
199 changes: 199 additions & 0 deletions mock_backend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# mock_backend — local stand-in for DrDroid cloud

A FastAPI service that speaks the exact wire protocol the VPC agent
expects. The agent has no idea it's not the real backend.

## What it does

| Surface | Purpose |
|---|---|
| `GET /connectors/proxy/ping` | startup reachability check (agent app `ready()`) |
| `POST /connectors/proxy/ping` | periodic heartbeat with pod health (every 50s) |
| `POST /connectors/proxy/register` | connector registration |
| `POST /connectors/proxy/connector/connection/tests` | poll for connection-test requests (every 10s) |
| `POST /connectors/proxy/connector/connection/results` | result upload |
| `POST /connectors/proxy/connector/metadata/register` | asset-metadata batches from extractors |
| `POST /playbooks-engine/proxy/execution/tasks` | poll for tasks (every 1s) — incl. `ASSET_REFRESH` |
| `POST /playbooks-engine/proxy/execution/results` | task / asset_refresh result upload |

Auth is `Authorization: Bearer <token>`. Tokens are issued by the mock
itself — the agent only knows it via `DRD_CLOUD_API_TOKEN`.

Every inbound request (body + headers + query + the response we sent) is
recorded as a JSONL event under `data/recorded/<bucket>.jsonl`.

## Layout

```
mock_backend/
├── app.py # FastAPI app: proxy + admin + catch-all
├── storage.py # JSON persistence (tokens, recordings, queues)
├── seed_scenarios.py # push a scenarios file into the admin API
├── e2e.py # poll recordings + assert agent behaviour
├── scenarios/default.json # default seeded scenarios
├── run_mock.sh # foreground: mock + token
├── run_agent.sh # foreground: celery beat/workers pointed at the mock
├── run_e2e.sh # one-shot loop: mock → agent → seed → validate
└── data/ # gitignored — recordings, queues, logs
```

## Three ways to run it

### 1. Test the actual Docker image in Kind (recommended for image migrations)

This is what you want when validating things like the Debian → Alpine
base swap. The agent runs from its real image inside Kind; the mock
runs as a sibling pod in the same `drdroid` namespace.

```bash
# from repo root
./mock_backend/deploy_kind.sh
```

**Build-time note:** the alpine agent image takes ~20 min to build on
first run because several deps (psycopg2-binary, pymongo, sqlalchemy,
cryptography, lxml) have no musllinux wheels and compile from source.
By default `deploy_kind.sh` **auto-reuses** an existing
`drd-vpc-agent:local-<commit>` image — subsequent iterations are ~30s.

| When | Flag |
|---|---|
| Agent code changed → must rebuild | `--rebuild-agent` (or `--rebuild`) |
| Only mock/scenarios changed → reuse agent | (default) |
| Force-skip everything (fastest) | `--reuse-image` |
| Mock backend code changed | (auto-detected — mock rebuilds when src is newer than image) |

What it does:
1. ensures the `drd-local` Kind cluster + `drdroid` namespace
2. builds `mock_backend/Dockerfile` → `drd-vpc-agent-mock:local-latest`
3. `kind load`s it and applies `mock_backend/k8s/manifests.yaml`
4. waits for readiness, opens a `kubectl port-forward` on `:18080`
5. mints a token via `POST /admin/tokens`
6. shells out to `./deploy_local.sh <token>` with
`DRD_CLOUD_API_HOST=http://drd-vpc-agent-mock.drdroid.svc.cluster.local:8080`
so the agent pods talk to the mock over cluster DNS
7. seeds `scenarios/default.json`
8. runs `e2e.py` and reports PASS/FAIL

You don't bring a token — the mock issues one. Reset everything between
runs with `./mock_backend/deploy_kind.sh --reset-only`.

### 2. Skip Kind, run agent directly via celery (fast feedback on code)

```bash
# Prereq: redis on :6379, repo Python deps importable
./mock_backend/run_e2e.sh
```

Exit 0 = PASS, 1 = FAIL. Logs in `mock_backend/data/`.

### 3. Manual three-terminal loop

```bash
# terminal 1 — mock
./mock_backend/run_mock.sh
# (note the printed DRD_CLOUD_API_TOKEN / DRD_CLOUD_API_HOST)

# terminal 2 — agent
DRD_CLOUD_API_TOKEN=... DRD_CLOUD_API_HOST=http://127.0.0.1:8080 \
./mock_backend/run_agent.sh

# terminal 3 — push scenarios + validate
.venv/bin/python mock_backend/seed_scenarios.py --reset
.venv/bin/python mock_backend/e2e.py
```

## Admin API (test control)

| Method | Path | Purpose |
|---|---|---|
| `POST` | `/admin/tokens` | mint a token (returns `{token, label}`) |
| `GET` | `/admin/tokens` | list tokens |
| `DELETE` | `/admin/tokens/{token}` | revoke a token |
| `POST` | `/admin/queues/connection-tests` | enqueue a connection test (or `{items:[…]}` batch) |
| `POST` | `/admin/queues/playbook-tasks` | enqueue a playbook task / `ASSET_REFRESH` |
| `GET` | `/admin/queues/{name}` | peek pending |
| `DELETE` | `/admin/queues/{name}` | clear queue |
| `GET` | `/admin/recorded` | list recording buckets |
| `GET` | `/admin/recorded/{bucket}` | list events in a bucket |
| `POST` | `/admin/reset?keep_tokens=true` | wipe queues + recordings |

Recording buckets emitted by the proxy endpoints:
`ping_get`, `ping_post`, `register_connectors`, `connection_tests_poll`,
`connection_tests_results`, `metadata_register`, `playbook_tasks_poll`,
`playbook_tasks_results`, `unmatched`.

## Scenarios

Edit `scenarios/default.json`. Two arrays:

- `connection_tests`: each item is `{connector_name, request_id?}`. The
agent looks up `connector_name` in `credentials/secrets.yaml`, runs a
real connection test, and posts the result back.
- `playbook_tasks`: each item is a `playbook_task_execution` dict. Two
task shapes are exercised by default:

**Real kubectl tasks** (canonical proto3 JSON — `dict_to_proto` parses
this directly into `PlaybookTask`):
```json
{
"task": {
"source": "KUBERNETES",
"kubernetes": {
"type": "COMMAND",
"command": {"command": "kubectl get pods -n drdroid -o json"}
}
}
}
```
These run inside the agent pod via `KubectlApiProcessor` — no
connector configured in `secrets.yaml` is needed because the helm
chart sets `NATIVE_KUBERNETES_API_MODE=true` and `kubectl` is in the
agent's image. The agent's own RBAC (`drdroid-k8s-cluster-role`)
permits pods/services/events/namespaces.

**`ASSET_REFRESH`** (intercepted before proto parsing — wrapper-form
`{"value": …}` here is intentional):
```json
{
"task": {
"drd_proxy_agent": {
"type": "ASSET_REFRESH",
"asset_refresh": {
"connector_name": {"value": "native_k8s"},
"connector_type": {"value": 47},
"extractor_method": {"value": "extract_pods"}
}
}
}
}
```
`connector_type` is the protobuf `Source` enum int (KUBERNETES = 47).
Drop `extractor_method` to run every `extract_*` method.

Underscore-prefixed keys (`_doc`, `$schema_note`) are ignored — handy for
inline documentation.

### What `e2e.py` checks for kubectl tasks

For each kubectl scenario it pulls the agent's reported output from the
recordings and asserts:

1. The command's `output` is parseable JSON.
2. A stable, deployment-managed item is present (e.g. the mock-backend
pod must show up in `get pods -n drdroid`; the `drdroid` namespace
in `get namespaces`).
3. **Ground-truth match** — if `kubectl` is on the host PATH and points
at the same Kind cluster, the validator runs the same command on the
host and asserts the *set of names* matches what the agent reported.
Events are checked for "non-empty on both sides" only (events churn).

## Notes

- The mock holds queues in flat JSON files (`data/queues/*.json`); each
poll drains the file. Restart-safe.
- `data/` is gitignored. Tokens live there too — they don't survive a
hard reset (`POST /admin/reset?keep_tokens=false`).
- `scripts/start-celery-worker.sh` reads `CELERY_QUEUE`; we run a beat
+ a `celery` worker + an `exec` worker. The asset_extraction queue is
only enabled when `NATIVE_KUBERNETES_API_MODE=true`.
Loading
Loading