This repository contains customer-facing support scripts and binaries for collecting diagnostics or applying narrow workarounds on customer VMs. Assume the operator is a customer or support engineer running the tool locally on a machine we do not control and cannot access directly.
- First-class projects live under
customers/<name>/. Each project has its ownAGENTS.mdwith verification commands, stack-specific rules, and links toCODEMAP.md/ARCHITECTURE.mdas needed. - When changing code under a project directory, follow that project's
AGENTS.mdfor how to build, test, and review. - New projects may use any stack (Go, Node, Python, shell, etc.). The root file stays policy and discovery—not a catalog of every toolchain's commands.
customers/vm-troubleshooting/(gather-info) produces diagnostic archives (manifest, report stream, triage data, schemas).customers/vm-troubleshooting-dashboard/consumes those archives for ingest and UI.- Authoritative compatibility and versioning rules:
customers/vm-troubleshooting/SCHEMA-COMPATIBILITY.md. Producer or consumer changes that affect archive shape, schema majors, or mirrored JSON files usually belong in a coordinated change (both sides + that doc when applicable). - Dashboard
schemas/mirrors collectorschemas/; keep them aligned per each project'sAGENTS.mdchecklists.
- Prefer self-contained tooling with minimal external dependencies.
- Assume common Linux distributions first: Ubuntu 20.04/22.04/24.04. Treat other distros as best-effort unless explicitly supported.
- Tools must continue to provide useful output when some commands, drivers, services, or files are missing.
- Never assume outbound internet access.
- Never assume package managers are usable.
- Never assume GPUs, kernel modules, systemd, sudo, or container runtimes are present.
- Do not collect secrets or high-risk data.
- Never collect private keys, tokens, passwords, shell history, environment variables, cloud credentials, SSH material, or application secrets.
- Prefer explicit allowlists of collected files/commands over broad filesystem collection.
- Redact sensitive values if collection is unavoidable.
- Any remediation script must be narrowly scoped, reversible where possible, and clearly logged.
- Output must be understandable by a customer and useful to support.
- Print what is being collected and why when practical.
- Fail per section, not all-or-nothing.
- Exit codes must be meaningful.
- Write deterministic output paths and filenames.
- Prefer a single archive or directory that the customer can send back.
- Prefer read-only diagnostics.
- For any mutating action, require explicit user intent and document risk.
- Use timeouts for subprocesses and slow probes.
- Handle missing commands gracefully.
- Avoid distro-specific parsing when a portable interface exists.
- Avoid brittle text parsing of commands whose output changes across versions.
- Prefer machine-readable sources when available.
- Do not silently ignore errors that affect support value; record them in output.
Before considering work complete, run the narrowest relevant checks for what you changed.
Root-level assets only (e.g. scripts in the repo root):
- Bash lint:
shellcheck nvidia-drm-disable-modeset.sh - Bash syntax:
bash -n nvidia-drm-disable-modeset.sh
Anything under customers/: use that project's AGENTS.md verification section (Go, frontend, etc.).
customers/: shipped or support-facing tools and assets; each subfolder is a project with its ownAGENTS.md.- Root scripts: focused one-off support or remediation utilities (verify with root-only commands above).
customers/vm-troubleshooting/: diagnostics collector (gather-info). Maps:CODEMAP.md,ARCHITECTURE.md.customers/vm-troubleshooting-dashboard/: dashboard (Go API + UI). Map:CODEMAP.md.docs/: planning notes and indexes (docs/README.md,docs/architecture.mdpoint to project-local maps).
When architecture or boundaries change materially, update the relevant project orientation docs in the same change:
- Collector:
customers/vm-troubleshooting/CODEMAP.md;ARCHITECTURE.mdwhen pipeline/types/modes narratives change. - Dashboard:
customers/vm-troubleshooting-dashboard/CODEMAP.mdwhen package layout or ingest/API flow changes.
For Go-based diagnostics behavior (modular collectors, timeouts, static builds, graceful skips), follow the collector project's AGENTS.md; do not duplicate those rules here.
If you add continuous integration, path-scoped jobs (build/test only projects whose paths changed) are a practical way to keep feedback fast as customers/ grows. This is optional operational practice, not a requirement of this repository.
A change is not done until:
- the relevant checks pass,
- the tool behavior is explained in customer-safe terms,
- output remains useful on partially broken systems,
- and no new sensitive data is collected.