Skip to content

feat(sandbox): integrate OCSF structured logging for sandbox events#720

Open
johntmyers wants to merge 19 commits intomainfrom
feat/ocsf-log-integration
Open

feat(sandbox): integrate OCSF structured logging for sandbox events#720
johntmyers wants to merge 19 commits intomainfrom
feat/ocsf-log-integration

Conversation

@johntmyers
Copy link
Copy Markdown
Collaborator

Summary

Replace ad-hoc tracing log calls across the sandbox with OCSF v1.7.0 structured events using the openshell-ocsf crate (PR #489). Every network connection, process lifecycle event, filesystem policy decision, SSH authentication, and configuration change now emits a typed OCSF event with a human-readable shorthand format and optional JSONL export.

Related Issue

Builds on PR #489 (openshell-ocsf crate) and PR #474 (settings system).

Changes

Foundation

  • Register ocsf_json_enabled setting (Bool, defaults false) for JSONL export toggle
  • Replace stdout and file tracing::fmt layers with OcsfShorthandLayer
  • Add conditional OcsfJsonlLayer with Arc<AtomicBool> hot-toggle (daily rotation, 3 files)
  • Update LogPushLayer to use format_shorthand() for OCSF events and set level to OCSF
  • Create process-wide SandboxContext via OnceLock with test fallback

Event Migration (~110 call sites)

  • Network (proxy.rs, l7/, bypass_monitor.rs, mechanistic_mapper.rs): CONNECT/FORWARD allow/deny, SSRF blocks, L7 decisions, bypass detection (dual-emit with DetectionFinding), DNS failures, inference interception
  • SSH (ssh.rs): Handshake accepted/denied, nonce replay (dual-emit), direct-tcpip, listen
  • Process (lib.rs, process.rs): Launch, exit, timeout kill, SIGTERM failure
  • Filesystem (landlock.rs, sandbox/mod.rs, lib.rs): Landlock apply/unavailable, disk policy load/validate, platform sandbox warning
  • Config (lib.rs, opa.rs, netns.rs): Policy load/reload, inference routes, TLS setup, settings changes, bypass detection rules, provider env, network namespace lifecycle
  • Lifecycle (lib.rs): Supervisor start, SSH server ready/failed, poll loop exit

Shorthand Format

2026-04-01T04:04:32.118Z OCSF NET:OPEN [INFO] ALLOWED /usr/bin/curl(58) -> api.github.com:443 [policy:github_api engine:opa]
2026-04-01T04:04:32.690Z OCSF NET:OPEN [MED] DENIED /usr/bin/curl(64) -> httpbin.org:443 [policy:- engine:opa]
2026-04-01T04:04:13.058Z INFO openshell_sandbox: Starting sandbox
  • OCSF / INFO level label at column 25 for visual scanning
  • Severity as bracketed suffix after CLASS:ACTIVITY: [INFO], [MED], [HIGH], [CRIT]
  • Timestamps from the shorthand layer (UTC ISO 8601)
  • One SSH event per connection (intermediate handshake steps downgraded to debug)

Docker

  • Add openshell-ocsf to skeleton, stub, and source COPY stages in all Dockerfiles
  • Touch openshell-ocsf/src/lib.rs in supervisor-workspace to invalidate cargo cache

Docs

  • New Observability section in user-facing docs: Logging, Accessing Logs, OCSF JSON Export
  • OCSF logging guidance added to AGENTS.md (decision framework, class selection, severity, builder examples)

Smoke Tests

  • Attach provider to all phases to avoid GitHub API rate limiting
  • Add auth headers for credential injection in L4/L7 test phases
  • Accept 200 or 403 for tls:skip raw tunnel test

Testing

  • mise run pre-commit passes
  • 110 openshell-ocsf tests pass
  • 397 openshell-sandbox tests pass
  • Smoke tests pass (5/5 after rate limit fix)
  • Deployed and verified shorthand format in live sandbox logs
  • Verified OCSF JSONL output when ocsf_json_enabled is toggled on
  • Principal engineer review completed and all warnings addressed
  • E2E tests (require cluster)

Checklist

  • Follows Conventional Commits
  • Architecture docs updated (AGENTS.md OCSF guidance)
  • User-facing docs updated (docs/observability/)
  • No secrets logged (query params excluded, redacted targets used)
  • Docker build contexts updated for all Dockerfiles

WIP: Replace ad-hoc tracing calls with OCSF event builders across all
sandbox subsystems (network, SSH, process, filesystem, config, lifecycle).

- Register ocsf_logging_enabled setting (defaults false)
- Replace stdout/file fmt layers with OcsfShorthandLayer
- Add conditional OcsfJsonlLayer for /var/log/openshell-ocsf.log
- Update LogPushLayer to extract OCSF shorthand for gRPC push
- Migrate ~106 log sites to OCSF builders (NetworkActivity, HttpActivity,
  SshActivity, ProcessActivity, DetectionFinding, ConfigStateChange,
  AppLifecycle)
- Add openshell-ocsf to all Docker build contexts
…limits

GitHub's unauthenticated API rate limit (60/hour) causes flaky 403s for
Phases 1, 2, and 4. Fix by attaching the provider to all sandboxes and
upgrading the Phase 1 policy to L7 so credential injection works.

Phase 4 (tls:skip) cannot inject credentials by design, so relax the
assertion to accept either 200 or 403 from upstream -- both prove the
proxy forwarded the request.
…estamp

The display layer (gateway logs, TUI, sandbox logs CLI) already prepends
a timestamp. Having one in the shorthand output too produces redundant
double-timestamps like:

  15:49:11 sandbox INFO  15:49:11.649 I NET:OPEN ALLOWED ...

Now the shorthand is just the severity + structured content:

  15:49:11 sandbox INFO  I NET:OPEN ALLOWED ...
Replace cryptic single-character severity codes (I/L/M/H/C/F) with
readable bracketed labels: [LOW], [MED], [HIGH], [CRIT], [FATAL].

Informational severity (the happy-path default) is omitted entirely to
keep normal log output clean and avoid redundancy with the tracing-level
INFO that the display layer already provides.

Before: sandbox INFO  I NET:OPEN ALLOWED ...
After:  sandbox INFO  NET:OPEN ALLOWED ...

Before: sandbox INFO  M NET:OPEN DENIED ...
After:  sandbox INFO  [MED] NET:OPEN DENIED ...
Set the level field to 'OCSF' instead of 'INFO' for OCSF events in the
gRPC log push. This visually distinguishes structured OCSF events from
plain tracing output in the TUI and CLI sandbox logs:

  sandbox OCSF  NET:OPEN [INFO] ALLOWED python3(42) -> api.example.com:443
  sandbox OCSF  NET:OPEN [MED] DENIED python3(42) -> blocked.com:443
  sandbox INFO  Fetching sandbox policy via gRPC
PR #677 added a warn!() for inaccessible Landlock paths in best-effort
mode. Convert to ConfigStateChangeBuilder with degraded state so it
flows through the OCSF shorthand format consistently.
Match the main openshell.log rotation mechanics (daily, 3 files max)
instead of a single unbounded append-only file. Prevents disk exhaustion
when ocsf_logging_enabled is left on in long-running sandboxes.
W1: Remove redundant 'OCSF' prefix from shorthand file layer — the
    class name (NET:OPEN, HTTP:GET) already identifies structured events
    and the LogPushLayer separately sets the level field.

W2: Log a debug message when OCSF_CTX.set() is called a second time
    instead of silently discarding via let _.

W3: Document the boundary between OCSF-migrated events and intentionally
    plain tracing calls (DEBUG/TRACE, transient, internal plumbing).

W4: Migrate remaining iptables LOG rule failure warnings in netns.rs
    (IPv4 TCP/UDP, IPv6 TCP/UDP) to ConfigStateChangeBuilder for
    consistency with the IPv4 bypass rule failure already migrated.

W5: Migrate malformed inference request warn to NetworkActivity with
    ActivityId::Refuse and SeverityId::Medium.

W6: Use Medium severity for L7 deny decisions (both CONNECT tunnel and
    FORWARD proxy paths) to match the CONNECT deny severity pattern.
    Allows and audits remain Informational.
The shorthand logs are already OCSF-structured events. The setting
specifically controls the JSONL file export, so the name should reflect
that: ocsf_json_enabled.
The OcsfShorthandLayer writes directly to the log file with no outer
display layer to supply timestamps. Add a UTC timestamp prefix to every
line so the file output matches what tracing::fmt used to provide.

Before: CONFIG:VALIDATED [INFO] Validated 'sandbox' user exists in image
After:  2026-04-01T15:49:11.649Z CONFIG:VALIDATED [INFO] Validated ...
The supervisor-workspace stage touches sandbox and core sources to force
recompilation over the rust-deps dummy stubs, but openshell-ocsf was
missing. This caused the Docker cargo cache to use stale ocsf objects
from the deps stage, preventing changes to the ocsf crate (like the
timestamp fix) from appearing in the final binary.

Also adds a shorthand layer test verifying timestamp output, and drafts
the observability docs section.
Without a level prefix, OCSF events in the log file have no visual
anchor at the position where standard tracing lines show INFO/WARN.
This makes scanning the file harder since the eye has nothing consistent
to lock onto after the timestamp.

Before: 2026-04-01T04:04:13.065Z CONFIG:DISCOVERY [INFO] ...
After:  2026-04-01T04:04:13.065Z OCSF CONFIG:DISCOVERY [INFO] ...
- Fix double space in NET:LISTEN, SSH:LISTEN, and other events where
  action is empty (e.g., 'NET:LISTEN [INFO]  10.200.0.1' -> 'NET:LISTEN [INFO] 10.200.0.1')
- Add listen address to SSH:LISTEN event (was empty)
- Downgrade SSH handshake intermediate steps (reading preface, verifying)
  from OCSF events to debug!() traces. Only the final verdict
  (accepted/denied) is an OCSF event now, reducing noise from 3 events
  to 1 per SSH connection.
- Apply same spacing fix to HTTP shorthand for consistency.
…fixes

Align doc examples with the deployed output:
- Add OCSF level prefix to all shorthand examples in the log file
- Show mixed OCSF + standard tracing in the file format section
- Update listen events (no double space, SSH includes address)
- Show one SSH:OPEN per connection instead of three
- Update grep patterns to use 'OCSF NET:' etc.
Add a Sandbox Logging (OCSF) section to AGENTS.md so agents have
in-context guidance for deciding whether new log emissions should use
OCSF structured logging or plain tracing. Covers event class selection,
severity guidelines, builder API usage, dual-emit pattern for security
findings, and the no-secrets rule.

Also adds openshell-ocsf to the Architecture Overview table.
These files were already merged to main in separate PRs. They got
pulled into our branch during rebase conflict resolution for the
deleted docs-preview-pr.yml file.
@johntmyers johntmyers requested a review from a team as a code owner April 1, 2026 04:46
@johntmyers johntmyers self-assigned this Apr 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/OpenShell/pr-preview/pr-720/

Built to branch gh-pages at 2026-04-01 05:35 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@johntmyers johntmyers added the test:e2e Requires end-to-end coverage label Apr 1, 2026
Users access sandboxes via 'openshell sandbox connect', not direct SSH.
The settings CLI requires --key and --value named flags, not positional
arguments. Also fix the per-sandbox form: the sandbox name is a
positional argument, not a --sandbox flag.
The E2E tests asserted on the old tracing::fmt key=value format
(action=allow, l7_decision=audit, FORWARD, L7_REQUEST, always-blocked).
Update to match the new OCSF shorthand (ALLOWED/DENIED, HTTP:, NET:,
engine:ssrf, policy:).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant