Skip to content

fix: use podman unshare to clean up rootless home directories#107

Merged
ptone merged 2 commits intoGoogleCloudPlatform:mainfrom
Empiria:fix/podman-rootless-home-cleanup
Apr 12, 2026
Merged

fix: use podman unshare to clean up rootless home directories#107
ptone merged 2 commits intoGoogleCloudPlatform:mainfrom
Empiria:fix/podman-rootless-home-cleanup

Conversation

@meatballs
Copy link
Copy Markdown
Contributor

Summary

Fixes #105 — Agent deletion fails to clean up home directory in podman rootless mode.

In podman rootless mode, files created as root inside the container are owned by a mapped subuid on the host. DeleteAgentFiles() calls RemoveAllSafe() on the external home directory, but this fails silently because the host user can't remove mapped-UID files.

Fix: When RemoveAllSafe() fails, fall back to podman unshare rm -rf <path>. This enters the user namespace where the mapped UIDs are accessible, allowing cleanup without sudo.

This is the Podman equivalent of the chown -R pattern already used in the K8s runtime (pkg/runtime/k8s_runtime.go:302-308).

Test plan

  • go build / go vet clean
  • Delete agent in podman rootless mode — verify home directory is cleaned up
  • Delete agent on Docker or podman rootful — verify standard removal still works (podman unshare not called)
  • Verify podman unshare failure is non-fatal (graceful degradation if podman not available)

In podman rootless mode, files created as root inside the container are
owned by a mapped subuid on the host. When an agent is deleted,
RemoveAllSafe() fails silently because the host user can't remove these
files, leaving stale directories in ~/.scion/grove-configs/*/home/.

Fall back to `podman unshare rm -rf` when standard removal fails. This
enters the user namespace where the mapped UIDs are accessible, allowing
cleanup without requiring sudo.
@ptone
Copy link
Copy Markdown
Member

ptone commented Apr 11, 2026

PR #107 Review: fix: use podman unshare to clean up rootless home directories

Executive Summary

This PR makes two changes: (1) adds a podman unshare rm -rf fallback for cleaning up external agent home directories that fail standard removal due to podman rootless UID mapping, and (2) simplifies GetAgent by removing worktree recreation logic and relocating the stale-directory cleanup. The primary fix is low-risk and well-scoped; the GetAgent refactor removes defensive behavior that may have edge-case implications.


Critical Issues

1. Command Injection Risk via externalAgentDirLow severity (mitigated)

exec.Command("podman", "unshare", "rm", "-rf", externalAgentDir).Run()

exec.Command passes arguments as a list (not through a shell), so shell injection via metacharacters in externalAgentDir is not possible. Additionally, externalAgentDir is constructed from GetGitGroveExternalAgentsDir(projectDir) + agentName, where the external dir path comes from on-disk grove config and agentName is a function parameter (typically slugified). No action required, but worth noting for future maintainers that this path should never be user-controlled raw input passed to a shell.

2. Stale Directory Cleanup Lost Worktree Pruning — Medium severity

The old code in GetAgent pruned worktrees after removing a stale directory:

// OLD
os.RemoveAll(agentDir)
if root, rootErr := util.RepoRootDir(filepath.Dir(agentWorkspace)); rootErr == nil {
    _ = util.PruneWorktreesIn(root)
}

The new code removes the directory but does not prune worktrees:

// NEW
os.RemoveAll(agentDir)
// (no prune)

When a stale agent directory contained a worktree pointer (.git file), removing the directory without pruning leaves an orphaned worktree record in .git/worktrees/. The subsequent ProvisionAgent call does prune worktrees before creating a new one (line 376-380), so this is likely safe in practice, but the pruning now happens later in the flow rather than immediately after the stale directory removal. If ProvisionAgent fails before reaching the prune step, the orphaned record persists.

Suggested fix: Add the prune call back after os.RemoveAll(agentDir) in the stale directory block, or document why it's intentionally deferred.


Observations

3. Workspace Recreation Logic Removed — Behavioral Change

The old GetAgent had ~30 lines of logic to recreate missing worktrees for existing, fully-provisioned agents. The new code simply sets agentWorkspace = "" if the workspace doesn't exist. This is a significant behavioral change — previously, a missing workspace would be automatically recreated; now, the caller receives an empty workspace path.

This simplification may be intentional (perhaps the caller handles this case), but it means that if a worktree is accidentally deleted while the agent config still exists, the agent will no longer self-heal on resume. Verify that downstream consumers of GetAgent handle agentWorkspace == "" gracefully for git-backed agents.

4. podman unshare Failure is Silent

The fallback logs via util.Debugf but does not return an error. This matches the PR's stated design ("graceful degradation"), which is appropriate since the original code also silently logged the failure. The behavior is consistent.

5. No Timeout on exec.Command

exec.Command("podman", "unshare", "rm", "-rf", externalAgentDir).Run()

Run() blocks indefinitely. If podman unshare hangs (e.g., waiting for a user namespace lock), DeleteAgentFiles will never return. Consider using CommandContext with a timeout:

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if unshareErr := exec.CommandContext(ctx, "podman", "unshare", "rm", "-rf", externalAgentDir).Run(); unshareErr != nil {
    util.Debugf("delete: podman unshare removal also failed: %v", unshareErr)
}

This would require adding context to the imports (it's already imported elsewhere in the file via function parameters but DeleteAgentFiles doesn't currently take a context).


Positive Feedback

  • The comment block explaining the podman rootless UID mapping issue is clear and provides good context for future maintainers.
  • The fallback approach (try standard removal first, then podman unshare) is the correct pattern — it avoids unnecessary subprocess invocation on Docker/rootful setups.
  • Using exec.Command with argument list (not exec.Command("sh", "-c", ...)) correctly prevents shell injection.
  • The PR description is thorough and references the analogous K8s pattern.

Final Verdict

Approve with minor suggestions. The core fix (podman unshare fallback) is correct and low-risk. The two suggestions worth considering before merge:

  1. Add a timeout to the exec.Command call to prevent indefinite hangs (observation chore(deps-dev): bump rollup from 4.56.0 to 4.59.0 in /web #5).
  2. Verify that the removed worktree recreation logic in GetAgent is truly unnecessary, or add the worktree prune back after stale directory removal (issue chore(deps-dev): bump ajv from 6.12.6 to 6.14.0 in /web #2).

Neither is a blocking issue — the PR achieves its stated goal and degrades gracefully on failure.

@ptone
Copy link
Copy Markdown
Member

ptone commented Apr 11, 2026

@meatballs I'm happy to patch up minor found issues like this, or let you fix - your choice, just let me know

Addresses review feedback on GoogleCloudPlatform#107: exec.Command(...).Run() blocks
indefinitely if podman unshare hangs (e.g. waiting on a user
namespace lock). Use exec.CommandContext with a 30s timeout so
DeleteAgentFiles always returns.
@meatballs
Copy link
Copy Markdown
Contributor Author

Thanks for the review @ptone — pushed ac056d25 which adds a 30s context.WithTimeout to the podman unshare exec (observation #5). Prevents DeleteAgentFiles from blocking indefinitely if the user namespace lock hangs.

Re issues #2 and #3 (worktree prune in GetAgent, workspace recreation logic): I think those may have been picked up from combined-branch context rather than this PR. The PR diff only touches DeleteAgentFiles in provision.goGetAgent is untouched here. Happy to file a follow-up if you'd like the worktree-prune cleanup addressed separately.

meatballs added a commit to Empiria/scion that referenced this pull request Apr 12, 2026
Addresses review feedback on GoogleCloudPlatform#107: exec.Command(...).Run() blocks
indefinitely if podman unshare hangs (e.g. waiting on a user
namespace lock). Use exec.CommandContext with a 30s timeout so
DeleteAgentFiles always returns.
@ptone ptone merged commit ae09ce8 into GoogleCloudPlatform:main Apr 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent deletion fails to clean up home directory in podman rootless mode

2 participants