Select the agent container for Kubernetes logs by mfreeman451 · Pull Request #131 · GoogleCloudPlatform/scion

mfreeman451 · 2026-04-10T17:43:45Z

Summary

pick the main agent container when fetching logs from multi-container Kubernetes pods
avoid log retrieval failures that require an explicit container name
add focused container selection coverage

Testing

go test ./pkg/runtime -run 'TestSelectLogContainer$' -count=1

mfreeman451

lgtm

ptone · 2026-04-11T22:04:14Z

PR #131 Review: Select the agent container for Kubernetes logs

Reviewer: Senior Staff SWE / Security Researcher
Date: 2026-04-11
Branch: fix/k8s-agent-logs-container-selection
Files Changed: 17 | +460 / -61

Executive Summary

This PR bundles three logically distinct changes: (1) K8s multi-container log selection targeting the "agent" container, (2) a new exec API endpoint exposed through hub and broker, and (3) a mechanical refactor of string-literal action names into shared constants. The K8s log selection and constants refactor are low-risk, but the new exec endpoint introduces a significant attack surface that warrants careful scrutiny around input validation and timeout handling.

Critical Issues

1. `handleAgentExec` accepts but never forwards the `timeout` parameter

File: pkg/hub/handlers.go (new function handleAgentExec, ~line 1702)
File: pkg/hub/broker_http_transport.go (function ExecAgent, ~line 74)

The hub handler parses a Timeout field from the request body and passes it through the full dispatch chain (DispatchAgentExec -> ExecAgent). The broker HTTP transport marshals it into the JSON body sent to the runtime broker. However, the runtime broker's execCommand handler (pkg/runtimebroker/handlers.go:1268) parses ExecRequest.Timeout but never applies it — the rt.Exec(ctx, id, req.Command) call has no context deadline or timeout enforcement.

This means a malicious or buggy command (e.g., ["sleep", "999999"]) will block the HTTP request/goroutine indefinitely. The timeout field gives callers a false sense of safety.

Suggested Fix (runtime broker side — existing code, but relevant to this PR's contract):

// In pkg/runtimebroker/handlers.go, execCommand:
if req.Timeout > 0 {
    var cancel context.CancelFunc
    ctx, cancel = context.WithTimeout(ctx, time.Duration(req.Timeout)*time.Second)
    defer cancel()
}
output, err := rt.Exec(ctx, id, req.Command)

Note: This is partially pre-existing, but the PR is the first place the timeout is propagated from the hub layer. The hub handler should at minimum set a context deadline locally as a safety net.

2. `exitCode` is always hardcoded to `0`

File: pkg/hub/handlers.go, handleAgentExec response (line ~1738)

writeJSON(w, http.StatusOK, map[string]interface{}{
    "output":   output,
    "exitCode": 0,
})

The response always returns exitCode: 0, even if the underlying command failed (but produced stderr output and a non-zero exit). The runtime broker's execCommand has the same issue (noted with a // TODO there). For a new API contract being added in this PR, returning a misleading exitCode: 0 on all responses is incorrect and will confuse consumers.

Suggested Fix: Either omit the exitCode field until it is implemented, or propagate the actual exit code from the runtime. At minimum, document that it's not yet reliable.

Observations

3. No input bounds on command arguments

File: pkg/hub/handlers.go, handleAgentExec

The handler validates len(req.Command) == 0 but imposes no upper bound on the number of arguments or the size of individual strings. A request with thousands of arguments or multi-megabyte strings will be marshaled, transmitted, and passed to the container runtime.

Suggestion: Add reasonable bounds:

if len(req.Command) > 100 {
    ValidationError(w, "command has too many arguments", nil)
    return
}

4. PR scope is broader than the title suggests

The PR title is "Select the agent container for Kubernetes logs," but the majority of the diff (+400 lines) is the new exec endpoint plumbing and the action-constants refactor. Consider splitting into separate PRs for cleaner review and revert-ability:

fix/k8s-log-container-selection (the actual bug fix, ~30 lines)
refactor/agent-action-constants (mechanical, ~100 lines)
feat/agent-exec-endpoint (new feature, ~300 lines)

5. `selectLogContainer` uses a hardcoded container name

File: pkg/runtime/k8s_runtime.go, selectLogContainer

if container.Name == "agent" {
    return container.Name
}

The magic string "agent" is embedded without a constant or configuration option. If the container naming convention ever changes, this silently falls back to the first container. This is acceptable for now but should have a comment explaining the convention.

6. `handleAgentExec` duplicates the broker-assignment check

File: pkg/hub/handlers.go, handleAgentExec

The function checks agent.RuntimeBrokerID == "" inline, while the DispatchAgentExec implementation also calls requireRuntimeBrokerAssigned(agent). The duplication is harmless (defense-in-depth) but the error messages differ — the handler returns a ServiceNotReady HTTP response while the dispatcher returns a generic error. The handler's early check makes the dispatcher check unreachable for this code path.

7. No audit logging for exec operations

The codebase logs audit events for destructive operations (delete, link, unlink). Command execution is equally sensitive — the exec endpoint lets authenticated users run arbitrary commands inside agent containers. There should be an audit log entry capturing the caller identity, target agent, and command.

Positive Feedback

selectLogContainer is clean and well-tested. The function handles nil pod, single-container, multi-container with "agent", and fallback cases. The test table is thorough and covers edge cases including an empty pod. This is exactly the right fix for the K8s log retrieval issue.
Shared action constants (pkg/api/agent_actions.go) are a good refactor. Centralizing the string literals eliminates typo risk and makes action routing greppable across hub and runtime broker packages.
requireRuntimeBrokerAssigned consolidation is well done. Replacing 10+ copies of the same if agent.RuntimeBrokerID == "" check with a single sentinel error is clean idiomatic Go.
Test coverage for the new exec endpoint is solid. Both the direct agent route and the grove-scoped route are tested, and the mock dispatcher pattern follows established conventions.
The HybridBrokerClient.ExecAgent correctly prefers the control channel, consistent with all other dispatch methods in that struct.

Final Verdict

Needs Minor Rework.

The K8s container selection fix and the constants refactor are ready to merge. The exec endpoint is functionally complete and well-structured, but has two issues that should be addressed before merge:

The timeout parameter is accepted but never enforced — this is a correctness bug and a potential resource leak. At minimum, the hub handler should set a context deadline.
The hardcoded exitCode: 0 is a misleading API contract for a newly introduced endpoint.

Neither issue is a security blocker (the exec endpoint is already gated by authentication + authorization + grove isolation), but they represent incorrect behavior in a new public API surface that will be harder to fix once consumers depend on the current contract.

ptone · 2026-04-11T22:04:59Z

The exec API may need a bit more thought and/or some design doc to capture a decision - can we get an issue open for that?

mfreeman451 · 2026-04-11T23:49:33Z

This branch has also been rebuilt and narrowed on top of current main. The earlier review text about the broader exec/API diff no longer matches the current branch.

Current branch scope is now only:

pkg/runtime/k8s_runtime.go
pkg/runtime/k8s_runtime_test.go

So the surviving change here is the Kubernetes log-container selection fix. I also added an explicit comment documenting the existing container naming convention: hosted pods may include sidecars, but the interactive Scion process runs in the container named agent, so logs prefer that container when present.

Validation:

go test ./pkg/runtime -run '^TestSelectLogContainer$' -count=1\n\nCurrent head: 93454d38

mfreeman451 commented Apr 10, 2026

View reviewed changes

mfreeman451 marked this pull request as ready for review April 10, 2026 17:45

mfreeman451 force-pushed the fix/k8s-agent-logs-container-selection branch 2 times, most recently from 4808cf5 to 6501dd5 Compare April 11, 2026 06:24

Select the agent container for Kubernetes logs

fdd8059

mfreeman451 force-pushed the fix/k8s-agent-logs-container-selection branch from 6501dd5 to fdd8059 Compare April 11, 2026 23:21

Document agent log container selection convention

93454d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select the agent container for Kubernetes logs#131

Select the agent container for Kubernetes logs#131
mfreeman451 wants to merge 2 commits intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/k8s-agent-logs-container-selection

mfreeman451 commented Apr 10, 2026

Uh oh!

mfreeman451 left a comment

Uh oh!

ptone commented Apr 11, 2026

Uh oh!

ptone commented Apr 11, 2026

Uh oh!

mfreeman451 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mfreeman451 commented Apr 10, 2026

Summary

Testing

Uh oh!

mfreeman451 left a comment

Choose a reason for hiding this comment

Uh oh!

ptone commented Apr 11, 2026

PR #131 Review: Select the agent container for Kubernetes logs

Executive Summary

Critical Issues

1. handleAgentExec accepts but never forwards the timeout parameter

2. exitCode is always hardcoded to 0

Observations

3. No input bounds on command arguments

4. PR scope is broader than the title suggests

5. selectLogContainer uses a hardcoded container name

6. handleAgentExec duplicates the broker-assignment check

7. No audit logging for exec operations

Positive Feedback

Final Verdict

Uh oh!

ptone commented Apr 11, 2026

Uh oh!

mfreeman451 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `handleAgentExec` accepts but never forwards the `timeout` parameter

2. `exitCode` is always hardcoded to `0`

5. `selectLogContainer` uses a hardcoded container name

6. `handleAgentExec` duplicates the broker-assignment check