Summary
Move Graph Management Assistant (GMA) sticky sessions and batch extraction jobs from Docker-out-of-Docker (docker.sock on the API, agentic_ci + --network host) to NVIDIA OpenShell sandboxes with enforced network policy, provider-managed Vertex credentials, and auditable lifecycle events.
Dev is validating Track B (KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND=openshell, job_runner=openshell) against a host-local gateway. This issue tracks production rollout.
Current state
| Layer |
Dev (today) |
Prod (today) |
| GMA sticky runtime |
OpenShell or container (compose toggle) |
container (configmap default) |
| Extraction jobs |
OpenShell or agentic_ci |
agentic_ci + host network |
| Security Phase 0 |
Runtime auth on /v1/turn, no env workload token, Docker hardening |
Partial config only |
| OpenShell policies |
Bundled YAML (soft enforcement) |
ConfigMap at /etc/openshell/policies (hard_requirement) |
| Gateway |
systemd user service on developer host |
Not deployed |
Code supports both backends via ExtractionWorkloadRuntimeSettings and factory wiring (container / openshell / memory).
Target architecture (prod)
User → Kartograph API → openshell CLI → OpenShell Gateway → sandbox (sticky or job)
↑ ↑
mTLS + policy Vertex provider (google-vertex-ai)
- No
docker.sock on the API pod — API invokes OpenShell against a cluster-managed gateway endpoint.
- Network egress allowlisted per workload mode (Kartograph workload API +
inference.local / Vertex only).
- Credentials via OpenShell providers (
kartograph-gma, job provider), not gcloud ADC volume mounts in sandboxes.
- Sticky runtime auth unchanged: session-bound token on
/v1/turn; workload JWT per turn in request body.
Workstreams
1. OpenShell gateway (cluster infra)
2. Kartograph API runtime configuration
3. Policy maturity (Phase 3)
Per-mode policies (already authored under deploy/openshell/policies/):
Each should enforce:
4. Providers & inference
5. Observability & audit
6. Defense in depth
7. Deprecate Docker-out-of-Docker path
After prod soak:
Acceptance criteria
- GMA sticky session starts, passes health check, completes chat turns in prod with OpenShell backend.
- Extraction job runs end-to-end in OpenShell sandbox; reaches
kartograph-api:8000 without host network.
- Sandboxed egress denied to non-allowlisted destinations (negative test).
- No long-lived workload JWT in container/sandbox env; runtime auth required on
/v1/turn.
- Runbook documents gateway failure modes and rollback to container backend (temporary).
References
- Code:
src/api/extraction/infrastructure/openshell/
- Dev compose:
compose.dev.yaml (Track A / Track B toggle)
- Prod stubs:
deploy/apps/kartograph/base/openshell-policies-configmap.yaml, networkpolicy-sticky-runtime.yaml, configmap.yaml
- OpenShell docs: https://docs.nvidia.com/openshell/about/installation
Out of scope (separate issues)
- hp-fleet-gitops canonical deploy sync (this repo
deploy/ is reference/templates).
- Agent-runtime image hardening beyond current Phase 0 flags.
Summary
Move Graph Management Assistant (GMA) sticky sessions and batch extraction jobs from Docker-out-of-Docker (
docker.sockon the API,agentic_ci+--network host) to NVIDIA OpenShell sandboxes with enforced network policy, provider-managed Vertex credentials, and auditable lifecycle events.Dev is validating Track B (
KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND=openshell,job_runner=openshell) against a host-local gateway. This issue tracks production rollout.Current state
container(configmap default)agentic_ciagentic_ci+ host network/v1/turn, no env workload token, Docker hardeningsoftenforcement)/etc/openshell/policies(hard_requirement)Code supports both backends via
ExtractionWorkloadRuntimeSettingsand factory wiring (container/openshell/memory).Target architecture (prod)
docker.sockon the API pod — API invokes OpenShell against a cluster-managed gateway endpoint.inference.local/ Vertex only).kartograph-gma, job provider), not gcloud ADC volume mounts in sandboxes./v1/turn; workload JWT per turn in request body.Workstreams
1. OpenShell gateway (cluster infra)
openshell-gatewayas a dedicated Deployment/Service (or approved platform pattern), not per-node systemd.KARTOGRAPH_EXTRACTION_RUNTIME_OPENSHELL_GATEWAY_URL(in-cluster DNS, mTLS).2. Kartograph API runtime configuration
KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND=openshellKARTOGRAPH_EXTRACTION_RUNTIME_JOB_RUNNER=openshellKARTOGRAPH_EXTRACTION_RUNTIME_OPENSHELL_POLICY_ENFORCEMENT=hard_requirementKARTOGRAPH_EXTRACTION_RUNTIME_OPENSHELL_POLICY_DIR=/etc/openshell/policieskartograph-openshell-policiesConfigMap (already indeploy/apps/kartograph/base/).openshellCLI in API image or sidecar — API currently shells out toopenshell; prod must not depend on host PATH.docker.sock,gcloudconfig mounts, andagentic_cihost networking once OpenShell is proven.3. Policy maturity (Phase 3)
Per-mode policies (already authored under
deploy/openshell/policies/):gma-initial-schema-design.yamlgma-extraction-jobs.yamlgma-one-off-mutations.yamlextraction-job.yamlEach should enforce:
kartograph-api:8000, inference endpoints only (+ job-specific deps e.g. GitHub/PyPI for extraction jobs)./extraction/workloads/*).hard_requirementLandlock in prod overlays.4. Providers & inference
--type google-vertex-ai(not legacygoogle-cloud).5. Observability & audit
LoggingOpenShellRuntimeProbe(or successor) to structured domain logs / OCSF-aligned events for policy apply and sandbox lifecycle.6. Defense in depth
networkpolicy-sticky-runtime.yamlfor any remaining plain-container workloads during transition.7. Deprecate Docker-out-of-Docker path
After prod soak:
container/agentic_cifor local dev fallback.AgenticCiExtractionJobRunnerhost-network path from prod code paths (optional code cleanup issue).Acceptance criteria
kartograph-api:8000without host network./v1/turn.References
src/api/extraction/infrastructure/openshell/compose.dev.yaml(Track A / Track B toggle)deploy/apps/kartograph/base/openshell-policies-configmap.yaml,networkpolicy-sticky-runtime.yaml,configmap.yamlOut of scope (separate issues)
deploy/is reference/templates).