Production: OpenShell-backed GMA and extraction workloads

## Summary

Move Graph Management Assistant (GMA) sticky sessions and batch extraction jobs from **Docker-out-of-Docker** (`docker.sock` on the API, `agentic_ci` + `--network host`) to **NVIDIA OpenShell** sandboxes with enforced network policy, provider-managed Vertex credentials, and auditable lifecycle events.

Dev is validating **Track B** (`KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND=openshell`, `job_runner=openshell`) against a host-local gateway. This issue tracks production rollout.

## Current state

| Layer | Dev (today) | Prod (today) |
|-------|-------------|--------------|
| GMA sticky runtime | OpenShell or container (compose toggle) | `container` (configmap default) |
| Extraction jobs | OpenShell or `agentic_ci` | `agentic_ci` + host network |
| Security Phase 0 | Runtime auth on `/v1/turn`, no env workload token, Docker hardening | Partial config only |
| OpenShell policies | Bundled YAML (`soft` enforcement) | ConfigMap at `/etc/openshell/policies` (`hard_requirement`) |
| Gateway | systemd user service on developer host | **Not deployed** |

Code supports both backends via `ExtractionWorkloadRuntimeSettings` and factory wiring (`container` / `openshell` / `memory`).

## Target architecture (prod)

```
User → Kartograph API → openshell CLI → OpenShell Gateway → sandbox (sticky or job)
                              ↑                    ↑
                         mTLS + policy      Vertex provider (google-vertex-ai)
```

- **No `docker.sock` on the API pod** — API invokes OpenShell against a cluster-managed gateway endpoint.
- **Network egress** allowlisted per workload mode (Kartograph workload API + `inference.local` / Vertex only).
- **Credentials** via OpenShell providers (`kartograph-gma`, job provider), not gcloud ADC volume mounts in sandboxes.
- **Sticky runtime auth** unchanged: session-bound token on `/v1/turn`; workload JWT per turn in request body.

## Workstreams

### 1. OpenShell gateway (cluster infra)

- [ ] Deploy `openshell-gateway` as a dedicated Deployment/Service (or approved platform pattern), not per-node systemd.
- [ ] Register gateway with Kartograph API via `KARTOGRAPH_EXTRACTION_RUNTIME_OPENSHELL_GATEWAY_URL` (in-cluster DNS, mTLS).
- [ ] Document/runbook: gateway upgrades, cert rotation, health checks.
- [ ] Decide compute driver: **Docker** vs **Kubernetes** sandbox driver for prod (prefer K8s-native if OpenShell supports it in target cluster).

### 2. Kartograph API runtime configuration

- [ ] Set prod overlay:
  - `KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND=openshell`
  - `KARTOGRAPH_EXTRACTION_RUNTIME_JOB_RUNNER=openshell`
  - `KARTOGRAPH_EXTRACTION_RUNTIME_OPENSHELL_POLICY_ENFORCEMENT=hard_requirement`
  - `KARTOGRAPH_EXTRACTION_RUNTIME_OPENSHELL_POLICY_DIR=/etc/openshell/policies`
- [ ] Mount `kartograph-openshell-policies` ConfigMap (already in `deploy/apps/kartograph/base/`).
- [ ] Package/install `openshell` CLI in API image **or** sidecar — API currently shells out to `openshell`; prod must not depend on host PATH.
- [ ] Remove prod dependency on `docker.sock`, `gcloud` config mounts, and `agentic_ci` host networking once OpenShell is proven.

### 3. Policy maturity (Phase 3)

Per-mode policies (already authored under `deploy/openshell/policies/`):

- [ ] `gma-initial-schema-design.yaml`
- [ ] `gma-extraction-jobs.yaml`
- [ ] `gma-one-off-mutations.yaml`
- [ ] `extraction-job.yaml`

Each should enforce:

- [ ] L4 allowlist: `kartograph-api:8000`, inference endpoints only (+ job-specific deps e.g. GitHub/PyPI for extraction jobs).
- [ ] L7 path restrictions on workload API where OpenShell supports it (`/extraction/workloads/*`).
- [ ] `hard_requirement` Landlock in prod overlays.

### 4. Providers & inference

- [ ] Create prod OpenShell provider: `--type google-vertex-ai` (not legacy `google-cloud`).
- [ ] Service account or workload-identity-backed credentials; **no** developer ADC in prod.
- [ ] Configure gateway inference routing if required by agent harness.

### 5. Observability & audit

- [ ] Wire `LoggingOpenShellRuntimeProbe` (or successor) to structured domain logs / OCSF-aligned events for policy apply and sandbox lifecycle.
- [ ] Alert on sandbox start failures, policy compile failures, gateway disconnects.

### 6. Defense in depth

- [ ] Keep/adapt `networkpolicy-sticky-runtime.yaml` for any remaining plain-container workloads during transition.
- [ ] Retire container backend from prod config after soak period.

### 7. Deprecate Docker-out-of-Docker path

After prod soak:

- [ ] Default prod to OpenShell only; keep `container`/`agentic_ci` for local dev fallback.
- [ ] Remove `AgenticCiExtractionJobRunner` host-network path from prod code paths (optional code cleanup issue).

## Acceptance criteria

1. GMA sticky session starts, passes health check, completes chat turns in prod with OpenShell backend.
2. Extraction job runs end-to-end in OpenShell sandbox; reaches `kartograph-api:8000` without host network.
3. Sandboxed egress denied to non-allowlisted destinations (negative test).
4. No long-lived workload JWT in container/sandbox env; runtime auth required on `/v1/turn`.
5. Runbook documents gateway failure modes and rollback to container backend (temporary).

## References

- Code: `src/api/extraction/infrastructure/openshell/`
- Dev compose: `compose.dev.yaml` (Track A / Track B toggle)
- Prod stubs: `deploy/apps/kartograph/base/openshell-policies-configmap.yaml`, `networkpolicy-sticky-runtime.yaml`, `configmap.yaml`
- OpenShell docs: https://docs.nvidia.com/openshell/about/installation

## Out of scope (separate issues)

- hp-fleet-gitops canonical deploy sync (this repo `deploy/` is reference/templates).
- Agent-runtime image hardening beyond current Phase 0 flags.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Production: OpenShell-backed GMA and extraction workloads #774

Summary

Current state

Target architecture (prod)

Workstreams

1. OpenShell gateway (cluster infra)

2. Kartograph API runtime configuration

3. Policy maturity (Phase 3)

4. Providers & inference

5. Observability & audit

6. Defense in depth

7. Deprecate Docker-out-of-Docker path

Acceptance criteria

References

Out of scope (separate issues)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Layer	Dev (today)	Prod (today)
GMA sticky runtime	OpenShell or container (compose toggle)	`container` (configmap default)
Extraction jobs	OpenShell or `agentic_ci`	`agentic_ci` + host network
Security Phase 0	Runtime auth on `/v1/turn`, no env workload token, Docker hardening	Partial config only
OpenShell policies	Bundled YAML (`soft` enforcement)	ConfigMap at `/etc/openshell/policies` (`hard_requirement`)
Gateway	systemd user service on developer host	Not deployed

Uh oh!

Production: OpenShell-backed GMA and extraction workloads #774

Description

Summary

Current state

Target architecture (prod)

Workstreams

1. OpenShell gateway (cluster infra)

2. Kartograph API runtime configuration

3. Policy maturity (Phase 3)

4. Providers & inference

5. Observability & audit

6. Defense in depth

7. Deprecate Docker-out-of-Docker path

Acceptance criteria

References

Out of scope (separate issues)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions