Add GPU/Device Support and Fix Symlink Deduplication Issues#176
Add GPU/Device Support and Fix Symlink Deduplication Issues#176akshaver wants to merge 15 commits intomintoolkit:masterfrom
Conversation
Now one can pass in the cro runtime and GPU like
--cro-device-request '{"Count":-1, \
"Capabilities":[["gpu"]]}' \
--cro-runtime nvidia \
...
Correct issues with duplicate paths from symlink processing, when they both point to the same inode. This fixes a behavior where slimtookit would randomly generate o-byte files for the actual file referenced by the symlink.
Infinite loop in getStringParam: when a syscall had a NULL path argument, PtracePeekData returned EIO with count=0. The function silenced EIO, never advanced the pointer, and had no exit condition, burning 100% CPU until exit. Fix: return early on NULL pointer and on any PtracePeekData error.
…ubdir false positive OKReturnStatus for checkFile syscalls intentionally accepts ENOENT (-2) to track Python import search paths. When Python imports a namespace package (a directory without __init__.py), it probes for several __init__.* variants -- all returning ENOENT -- before stat'ing the directory itself (success). The radix tree walk in FileActivity() then sees the directory as a prefix of these ghost child paths and marks it IsSubdir=true, excluding it from the slim image. Since the ghost children don't exist on disk, neither the directory nor any files are preserved. Add HasSuccessfulAccess to FSActivityInfo, set it only when retVal==0, and require it in the IsSubdir determination so that ENOENT-only child paths cannot cause a parent directory to be dropped.
Code Review SummaryStatus: 8 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)CRITICAL
WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
Files Reviewed (16 files)
|
|
thank you @akshaver for creating the PR here! |
|
auggie review |
🤖 Augment PR SummarySummary: This PR adds GPU/device support for profiling runs and fixes several correctness issues in file/symlink processing and ptrace-based monitoring. Changes:
Technical Notes: Device requests are passed as JSON matching Docker’s 🤖 Was this summary useful? React with 👍 or 👎 |
|
Will try to address these comments today or tomorrow, so the PR doesn't lose momentum. |
… 0, not retVal == 0
…d false dedup across mounts
…opping the device config
Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>
Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>
|
@kcq let me know if you need anything else with the PR. |
pkg/monitor/ptrace/ptrace.go
Outdated
| fsa.OpsAll++ | ||
| fsa.Pids[e.pid] = struct{}{} | ||
| fsa.Syscalls[int(e.callNum)] = struct{}{} | ||
| if p.OKReturnStatus(e.retVal) { |
There was a problem hiding this comment.
CRITICAL: Incorrect indentation - this if statement should be indented to be inside the preceding if block
The if p.OKReturnStatus(e.retVal) on line 289 has the same indentation as the if fsa, ok := app.fsActivity[e.pathParam]; ok on line 285, but it should be indented further to be part of that block. This causes HasSuccessfulAccess to never be set for existing file activity entries.
| if p.OKReturnStatus(e.retVal) { | |
| if p.OKReturnStatus(e.retVal) { | |
| fsa.HasSuccessfulAccess = true | |
| } |
|
The recent kilo-code changes are breaking the vllm integration test. Tracking down why. |
…Status The kilo-code-bot suggested changing the HasSuccessfulAccess condition from `e.retVal == 0 || p.SyscallType() == OpenFileType` to `p.OKReturnStatus(e.retVal)`, arguing that the former incorrectly treated all OpenFileType syscalls as successful. However, the original expression was correct: - For OpenFileType (open, openat, readlink): the outer condition on line 276-278 already filters to successful calls via `p.OKReturnStatus(e.retVal)`, which requires fd >= 0. The `|| p.SyscallType() == OpenFileType` clause is redundant, not wrong -- it simply ensures HasSuccessfulAccess=true for events that already passed the success filter. - For CheckFileType (stat, access, lstat): `e.retVal == 0` correctly marks only files that actually exist on disk. The replacement `p.OKReturnStatus(e.retVal)` also accepts ENOENT (-2) and ENOTDIR (-20), causing non-existent "ghost" paths to be marked as HasSuccessfulAccess=true. This broke the namespace package fix (da8eb2c): when Python probes for __init__.cpython-311.so, __init__.so, __init__.py in a directory that has none of them (namespace package), those ENOENT stat results created ghost children with HasSuccessfulAccess=true. The FileActivity() radix tree walk then marked the parent directory as IsSubdir and excluded it from the slim image. Since the ghost children don't exist on disk, neither the directory nor its contents were preserved, causing libraries to go missing.
8c0c5e6 to
a2ad80e
Compare
|
Reverted back to pre kilo-code-bot suggestions. The kilo-code-bot suggested changing the HasSuccessfulAccess condition from Original expression was correct:
This broke the namespace package fix (da8eb2c): when Python probes for |
Resolved. kilo-code-bot was wrong. |
|
Thank you @akshaver for investigating and addressing all of those! Really appreciate it! |
Fixes-102, Fixes-306, Fixes-401, Fixes-579
What
This MR introduces four patches that add GPU support to docker-slim and fix critical symlink processing bugs:
Implement docker device request in CLI - Adds
--cro-device-requestand--cro-runtimeCLI flags to enable GPU access during container profiling. Allows passing device requests as JSON (e.g.,--cro-device-request '{"Count":-1, "Capabilities":[["gpu"]]}' --cro-runtime nvidia).Fix issues with duplicate symlink processing - Resolves a bug where files accessed through multiple symlinked paths (e.g.,
/usr/local/cuda/vs/usr/local/cuda-12.9/) would be copied multiple times, with later copies randomly overwriting with 0-byte content. Implements inode-based deduplication to ensure only one canonical copy is kept.Example use case with nvidia runtime - Adds documentation and example scripts demonstrating GPU workloads:
test_nvidia_smi.sh- Slims ubuntu to run nvidia-smitest_nvidia_pytorch.sh- Slims nvidia-pytorch to run CUDA testsNon-trivial example of slimming vllm - Adds a comprehensive example for slimming VLLM (LLM inference) containers with full API test suite validation.
Fixes PTRACE handling bugs Exits an infinite loop that occurs when syscall had a NULL path
argument. Don't drop parent paths when ENOENT-only child paths.
Why
GPU Support (Print os.Args prior to exiting. #102, #401): Many modern workloads require GPU access, especially in ML/AI contexts. Without
--cro-device-requestand--cro-runtime, it was impossible to profile containers that required GPU access during execution, making docker-slim unusable for CUDA-based images.Symlink Bug (#306, #579): The symlink deduplication bug caused slimmed images to have corrupted (0-byte) library files, particularly in NVIDIA CUDA containers where
/usr/local/cudasymlinks to versioned directories like/usr/local/cuda-12.9. This made slimmed images non-functional, as critical.sofiles would be empty.How Tested
Unit Tests: Added comprehensive test coverage for:
clifvgetter_test.go,config_test.go,device_request_test.go)dedup_test.go)ptrace_test.go)fsutil_test.go)Integration Tests:
test_nvidia_smi.sh- Verifies slimmed ubuntu image can run nvidia-smi with identical output to originaltest_nvidia_pytorch.sh- Verifies slimmed pytorch image passes CUDA teststest_nvidia_vllm.sh- Comprehensive VLLM API test suite comparing original vs slimmed image behavior (15 endpoint tests)Manual Testing: Successfully slimmed VLLM containers with GPU workloads and verified functional parity.