Skip to content

Select the agent container for Kubernetes logs#131

Open
mfreeman451 wants to merge 2 commits intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/k8s-agent-logs-container-selection
Open

Select the agent container for Kubernetes logs#131
mfreeman451 wants to merge 2 commits intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/k8s-agent-logs-container-selection

Conversation

@mfreeman451
Copy link
Copy Markdown
Contributor

Summary

  • pick the main agent container when fetching logs from multi-container Kubernetes pods
  • avoid log retrieval failures that require an explicit container name
  • add focused container selection coverage

Testing

  • go test ./pkg/runtime -run 'TestSelectLogContainer$' -count=1

Copy link
Copy Markdown
Contributor Author

@mfreeman451 mfreeman451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mfreeman451 mfreeman451 marked this pull request as ready for review April 10, 2026 17:45
@mfreeman451 mfreeman451 force-pushed the fix/k8s-agent-logs-container-selection branch 2 times, most recently from 4808cf5 to 6501dd5 Compare April 11, 2026 06:24
@ptone
Copy link
Copy Markdown
Member

ptone commented Apr 11, 2026

PR #131 Review: Select the agent container for Kubernetes logs

Reviewer: Senior Staff SWE / Security Researcher
Date: 2026-04-11
Branch: fix/k8s-agent-logs-container-selection
Files Changed: 17 | +460 / -61


Executive Summary

This PR bundles three logically distinct changes: (1) K8s multi-container log selection targeting the "agent" container, (2) a new exec API endpoint exposed through hub and broker, and (3) a mechanical refactor of string-literal action names into shared constants. The K8s log selection and constants refactor are low-risk, but the new exec endpoint introduces a significant attack surface that warrants careful scrutiny around input validation and timeout handling.


Critical Issues

1. handleAgentExec accepts but never forwards the timeout parameter

File: pkg/hub/handlers.go (new function handleAgentExec, ~line 1702)
File: pkg/hub/broker_http_transport.go (function ExecAgent, ~line 74)

The hub handler parses a Timeout field from the request body and passes it through the full dispatch chain (DispatchAgentExec -> ExecAgent). The broker HTTP transport marshals it into the JSON body sent to the runtime broker. However, the runtime broker's execCommand handler (pkg/runtimebroker/handlers.go:1268) parses ExecRequest.Timeout but never applies it — the rt.Exec(ctx, id, req.Command) call has no context deadline or timeout enforcement.

This means a malicious or buggy command (e.g., ["sleep", "999999"]) will block the HTTP request/goroutine indefinitely. The timeout field gives callers a false sense of safety.

Suggested Fix (runtime broker side — existing code, but relevant to this PR's contract):

// In pkg/runtimebroker/handlers.go, execCommand:
if req.Timeout > 0 {
    var cancel context.CancelFunc
    ctx, cancel = context.WithTimeout(ctx, time.Duration(req.Timeout)*time.Second)
    defer cancel()
}
output, err := rt.Exec(ctx, id, req.Command)

Note: This is partially pre-existing, but the PR is the first place the timeout is propagated from the hub layer. The hub handler should at minimum set a context deadline locally as a safety net.

2. exitCode is always hardcoded to 0

File: pkg/hub/handlers.go, handleAgentExec response (line ~1738)

writeJSON(w, http.StatusOK, map[string]interface{}{
    "output":   output,
    "exitCode": 0,
})

The response always returns exitCode: 0, even if the underlying command failed (but produced stderr output and a non-zero exit). The runtime broker's execCommand has the same issue (noted with a // TODO there). For a new API contract being added in this PR, returning a misleading exitCode: 0 on all responses is incorrect and will confuse consumers.

Suggested Fix: Either omit the exitCode field until it is implemented, or propagate the actual exit code from the runtime. At minimum, document that it's not yet reliable.


Observations

3. No input bounds on command arguments

File: pkg/hub/handlers.go, handleAgentExec

The handler validates len(req.Command) == 0 but imposes no upper bound on the number of arguments or the size of individual strings. A request with thousands of arguments or multi-megabyte strings will be marshaled, transmitted, and passed to the container runtime.

Suggestion: Add reasonable bounds:

if len(req.Command) > 100 {
    ValidationError(w, "command has too many arguments", nil)
    return
}

4. PR scope is broader than the title suggests

The PR title is "Select the agent container for Kubernetes logs," but the majority of the diff (+400 lines) is the new exec endpoint plumbing and the action-constants refactor. Consider splitting into separate PRs for cleaner review and revert-ability:

  • fix/k8s-log-container-selection (the actual bug fix, ~30 lines)
  • refactor/agent-action-constants (mechanical, ~100 lines)
  • feat/agent-exec-endpoint (new feature, ~300 lines)

5. selectLogContainer uses a hardcoded container name

File: pkg/runtime/k8s_runtime.go, selectLogContainer

if container.Name == "agent" {
    return container.Name
}

The magic string "agent" is embedded without a constant or configuration option. If the container naming convention ever changes, this silently falls back to the first container. This is acceptable for now but should have a comment explaining the convention.

6. handleAgentExec duplicates the broker-assignment check

File: pkg/hub/handlers.go, handleAgentExec

The function checks agent.RuntimeBrokerID == "" inline, while the DispatchAgentExec implementation also calls requireRuntimeBrokerAssigned(agent). The duplication is harmless (defense-in-depth) but the error messages differ — the handler returns a ServiceNotReady HTTP response while the dispatcher returns a generic error. The handler's early check makes the dispatcher check unreachable for this code path.

7. No audit logging for exec operations

The codebase logs audit events for destructive operations (delete, link, unlink). Command execution is equally sensitive — the exec endpoint lets authenticated users run arbitrary commands inside agent containers. There should be an audit log entry capturing the caller identity, target agent, and command.


Positive Feedback

  • selectLogContainer is clean and well-tested. The function handles nil pod, single-container, multi-container with "agent", and fallback cases. The test table is thorough and covers edge cases including an empty pod. This is exactly the right fix for the K8s log retrieval issue.

  • Shared action constants (pkg/api/agent_actions.go) are a good refactor. Centralizing the string literals eliminates typo risk and makes action routing greppable across hub and runtime broker packages.

  • requireRuntimeBrokerAssigned consolidation is well done. Replacing 10+ copies of the same if agent.RuntimeBrokerID == "" check with a single sentinel error is clean idiomatic Go.

  • Test coverage for the new exec endpoint is solid. Both the direct agent route and the grove-scoped route are tested, and the mock dispatcher pattern follows established conventions.

  • The HybridBrokerClient.ExecAgent correctly prefers the control channel, consistent with all other dispatch methods in that struct.


Final Verdict

Needs Minor Rework.

The K8s container selection fix and the constants refactor are ready to merge. The exec endpoint is functionally complete and well-structured, but has two issues that should be addressed before merge:

  1. The timeout parameter is accepted but never enforced — this is a correctness bug and a potential resource leak. At minimum, the hub handler should set a context deadline.
  2. The hardcoded exitCode: 0 is a misleading API contract for a newly introduced endpoint.

Neither issue is a security blocker (the exec endpoint is already gated by authentication + authorization + grove isolation), but they represent incorrect behavior in a new public API surface that will be harder to fix once consumers depend on the current contract.

@ptone
Copy link
Copy Markdown
Member

ptone commented Apr 11, 2026

The exec API may need a bit more thought and/or some design doc to capture a decision - can we get an issue open for that?

@mfreeman451 mfreeman451 force-pushed the fix/k8s-agent-logs-container-selection branch from 6501dd5 to fdd8059 Compare April 11, 2026 23:21
@mfreeman451
Copy link
Copy Markdown
Contributor Author

This branch has also been rebuilt and narrowed on top of current main. The earlier review text about the broader exec/API diff no longer matches the current branch.

Current branch scope is now only:

  • pkg/runtime/k8s_runtime.go
  • pkg/runtime/k8s_runtime_test.go

So the surviving change here is the Kubernetes log-container selection fix. I also added an explicit comment documenting the existing container naming convention: hosted pods may include sidecars, but the interactive Scion process runs in the container named agent, so logs prefer that container when present.

Validation:

  • go test ./pkg/runtime -run '^TestSelectLogContainer$' -count=1\n\nCurrent head: 93454d38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants