Skip to content

Production: OpenShell-backed GMA and extraction workloads #774

Description

@aredenba-rh

Summary

Move Graph Management Assistant (GMA) sticky sessions and batch extraction jobs from Docker-out-of-Docker (docker.sock on the API, agentic_ci + --network host) to NVIDIA OpenShell sandboxes with enforced network policy, provider-managed Vertex credentials, and auditable lifecycle events.

Dev is validating Track B (KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND=openshell, job_runner=openshell) against a host-local gateway. This issue tracks production rollout.

Current state

Layer Dev (today) Prod (today)
GMA sticky runtime OpenShell or container (compose toggle) container (configmap default)
Extraction jobs OpenShell or agentic_ci agentic_ci + host network
Security Phase 0 Runtime auth on /v1/turn, no env workload token, Docker hardening Partial config only
OpenShell policies Bundled YAML (soft enforcement) ConfigMap at /etc/openshell/policies (hard_requirement)
Gateway systemd user service on developer host Not deployed

Code supports both backends via ExtractionWorkloadRuntimeSettings and factory wiring (container / openshell / memory).

Target architecture (prod)

User → Kartograph API → openshell CLI → OpenShell Gateway → sandbox (sticky or job)
                              ↑                    ↑
                         mTLS + policy      Vertex provider (google-vertex-ai)
  • No docker.sock on the API pod — API invokes OpenShell against a cluster-managed gateway endpoint.
  • Network egress allowlisted per workload mode (Kartograph workload API + inference.local / Vertex only).
  • Credentials via OpenShell providers (kartograph-gma, job provider), not gcloud ADC volume mounts in sandboxes.
  • Sticky runtime auth unchanged: session-bound token on /v1/turn; workload JWT per turn in request body.

Workstreams

1. OpenShell gateway (cluster infra)

  • Deploy openshell-gateway as a dedicated Deployment/Service (or approved platform pattern), not per-node systemd.
  • Register gateway with Kartograph API via KARTOGRAPH_EXTRACTION_RUNTIME_OPENSHELL_GATEWAY_URL (in-cluster DNS, mTLS).
  • Document/runbook: gateway upgrades, cert rotation, health checks.
  • Decide compute driver: Docker vs Kubernetes sandbox driver for prod (prefer K8s-native if OpenShell supports it in target cluster).

2. Kartograph API runtime configuration

  • Set prod overlay:
    • KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND=openshell
    • KARTOGRAPH_EXTRACTION_RUNTIME_JOB_RUNNER=openshell
    • KARTOGRAPH_EXTRACTION_RUNTIME_OPENSHELL_POLICY_ENFORCEMENT=hard_requirement
    • KARTOGRAPH_EXTRACTION_RUNTIME_OPENSHELL_POLICY_DIR=/etc/openshell/policies
  • Mount kartograph-openshell-policies ConfigMap (already in deploy/apps/kartograph/base/).
  • Package/install openshell CLI in API image or sidecar — API currently shells out to openshell; prod must not depend on host PATH.
  • Remove prod dependency on docker.sock, gcloud config mounts, and agentic_ci host networking once OpenShell is proven.

3. Policy maturity (Phase 3)

Per-mode policies (already authored under deploy/openshell/policies/):

  • gma-initial-schema-design.yaml
  • gma-extraction-jobs.yaml
  • gma-one-off-mutations.yaml
  • extraction-job.yaml

Each should enforce:

  • L4 allowlist: kartograph-api:8000, inference endpoints only (+ job-specific deps e.g. GitHub/PyPI for extraction jobs).
  • L7 path restrictions on workload API where OpenShell supports it (/extraction/workloads/*).
  • hard_requirement Landlock in prod overlays.

4. Providers & inference

  • Create prod OpenShell provider: --type google-vertex-ai (not legacy google-cloud).
  • Service account or workload-identity-backed credentials; no developer ADC in prod.
  • Configure gateway inference routing if required by agent harness.

5. Observability & audit

  • Wire LoggingOpenShellRuntimeProbe (or successor) to structured domain logs / OCSF-aligned events for policy apply and sandbox lifecycle.
  • Alert on sandbox start failures, policy compile failures, gateway disconnects.

6. Defense in depth

  • Keep/adapt networkpolicy-sticky-runtime.yaml for any remaining plain-container workloads during transition.
  • Retire container backend from prod config after soak period.

7. Deprecate Docker-out-of-Docker path

After prod soak:

  • Default prod to OpenShell only; keep container/agentic_ci for local dev fallback.
  • Remove AgenticCiExtractionJobRunner host-network path from prod code paths (optional code cleanup issue).

Acceptance criteria

  1. GMA sticky session starts, passes health check, completes chat turns in prod with OpenShell backend.
  2. Extraction job runs end-to-end in OpenShell sandbox; reaches kartograph-api:8000 without host network.
  3. Sandboxed egress denied to non-allowlisted destinations (negative test).
  4. No long-lived workload JWT in container/sandbox env; runtime auth required on /v1/turn.
  5. Runbook documents gateway failure modes and rollback to container backend (temporary).

References

  • Code: src/api/extraction/infrastructure/openshell/
  • Dev compose: compose.dev.yaml (Track A / Track B toggle)
  • Prod stubs: deploy/apps/kartograph/base/openshell-policies-configmap.yaml, networkpolicy-sticky-runtime.yaml, configmap.yaml
  • OpenShell docs: https://docs.nvidia.com/openshell/about/installation

Out of scope (separate issues)

  • hp-fleet-gitops canonical deploy sync (this repo deploy/ is reference/templates).
  • Agent-runtime image hardening beyond current Phase 0 flags.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions