sudo fleetint scanPerforms a quick health scan of GPUs and system components. Returns immediately with a summary of any detected issues.
Aliases: sudo fleetint check, sudo fleetint s
sudo fleetint runStarts the API server. By default it listens on a Unix socket at /run/fleetint/fleetint.sock (access controlled by file permissions). Pass --listen-address to switch to TCP.
Options:
--log-level: Set logging level (debug, info, warn, error)--listen-address: Listen address. An absolute path (e.g./run/fleetint/fleetint.sock) creates a Unix socket; ahost:portvalue (e.g.127.0.0.1:15133) opens a TCP listener. Default:/run/fleetint/fleetint.sock. See Exposing the Agent for External Monitoring for details on exposing to Prometheus and other tools.--components: Enable/disable specific components
sudo fleetint statusDisplays the current status of the fleetint service and monitored components.
Alias: sudo fleetint st
sudo fleetint machine-infoShows detailed information about the machine:
- Hardware specifications (CPU, memory, disk)
- GPU configuration and driver version
- CUDA runtime version
- Operating system and kernel version
- Network interfaces
- System UUID and machine ID
# View current metadata
sudo fleetint metadata
# Set metadata key-value pair
sudo fleetint metadata --set-key="key" --set-value="value"Used to view or update the agent's metadata store, including remote export configuration.
sudo fleetint compactCompacts the local Fleet Intelligence state database to reduce disk usage.
Requirements:
fleetintdmust be stopped before runningcompact- the agent must not be running (no active listener on the default socket or port)
- the command needs write access to the state database, so package installs typically require
sudo
Typical workflow:
sudo systemctl stop fleetintd
sudo fleetint compact
sudo systemctl start fleetintdsudo fleetint precheckValidates the local prerequisites required for installation and enrollment.
What it checks:
- NVIDIA GPU presence
- supported GPU architecture (
Hopper,Blackwell,Rubin) - NVIDIA driver major version (
510or newer) - DCGM HostEngine reachability
- DCGM HostEngine minimum version (
4.2.3) nvattestavailability
The command prints each check result and exits non-zero if any hard requirement fails.
Environment Variables (DCGM connection):
DCGM_URL: Address of the DCGM HostEngine (default:localhost)DCGM_URL_IS_UNIX_SOCKET: Set totrueifDCGM_URLis a Unix socket path (default:false)
# Pass token directly (visible in process list)
sudo fleetint enroll --endpoint=https://api.example.com --token=<your-sak-token>
# Read token from a file (recommended)
sudo fleetint enroll --endpoint=https://api.example.com --token-file=/path/to/token
# Read token from stdin
echo "$TOKEN" | sudo fleetint enroll --endpoint=https://api.example.com --token-file=-Enrolls the agent with the Fleet Intelligence backend by exchanging a Service Account Key (SAK) token for a JWT token. The JWT token and backend endpoints are stored locally for subsequent data exports.
Required Options:
--endpoint: Base endpoint URL for the Fleet Intelligence backend (must use HTTPS)--token: Service Account Key (SAK) token for authentication (mutually exclusive with--token-file)--token-file: Path to a file containing the SAK token, or-to read from stdin (mutually exclusive with--token). Preferred over--tokenbecause it avoids exposing the token in/proc/<pid>/cmdline.
One of --token or --token-file is required.
Optional Flags:
--force: Continue enrollment even iffleetint precheckfails
What it does:
- Runs the same prerequisite validation as
fleetint precheck - Validates the endpoint URL (must be HTTPS)
- Makes an enrollment request to exchange the SAK token for a JWT token
- Stores the JWT token and backend endpoints (metrics, logs, nonce) in the local metadata database
- The stored credentials are used automatically by the agent for data export
Example output:
Enrollment succeeded
Error handling:
- precheck failure: Enrollment is blocked unless
--forceis set - 400: Token format is incorrect
- 401: Token is invalid
- 403: Token is expired or revoked
- 404: Endpoint not found
- 429: Server is rate limiting (retry later)
- 502/503/504: Temporary server issues (retry)
sudo fleetint unenrollRemoves all enrollment credentials and backend endpoints from the agent. After unenrolling, the agent will no longer export data to the backend until re-enrolled.
What it does:
- Clears the JWT token from local storage
- Clears the SAK token from local storage
- Removes all stored backend endpoints (enroll, metrics, logs, nonce)
Use this command when:
- Decommissioning a machine
- Switching to a different backend
- Troubleshooting authentication issues
For environments without network access or when you need to collect data to files:
sudo fleetint run --offline-mode --path=/tmp/fleetint --duration=00:05:00 --format=csvOptions:
--offline-mode: Disable HTTP API server and export to files--path: Absolute path to the output directory. Must not be inside restricted system directories (/etc,/usr,/sys,/bin,/boot,/dev,/lib,/proc,/run,/sbin).--duration: How long to collect data (format: HH:MM:SS)--format: Output format (csvorjson)
After package installation, the agent runs as a systemd service:
# Check service status
sudo systemctl status fleetintd
# Start/stop/restart service
sudo systemctl start fleetintd
sudo systemctl stop fleetintd
sudo systemctl restart fleetintd
# View logs
sudo journalctl -u fleetintd -fThe fleetint API server listens on a Unix socket (/run/fleetint/fleetint.sock) by default. When started with a TCP address (e.g. --listen-address=127.0.0.1:15133), the REST endpoints are also available over plain HTTP.
Using curl with the default Unix socket (requires sudo since the socket is owner-only):
sudo curl --unix-socket /run/fleetint/fleetint.sock http://localhost/healthzUsing curl with TCP (requires --listen-address=127.0.0.1:15133):
curl http://localhost:15133/healthzThe examples below use the TCP form for brevity. For the default socket, prefix with sudo and substitute --unix-socket /run/fleetint/fleetint.sock http://localhost for the hostname.
curl http://localhost:15133/healthzReturns the health status of the API server
Response:
{
"status": "ok",
"version": "v1"
}curl http://localhost:15133/machine-infoReturns basic machine info
Note: Detailed hardware and GPU information is available via the fleetint machine-info CLI command.
curl http://localhost:15133/v1/statesReturns the current health states of all monitored components
curl http://localhost:15133/v1/metricsReturns metrics data in JSON format from all monitored components
Query Parameters:
startTime: Unix timestamp to retrieve metrics since a specific timecomponents: Filter metrics by component name
Example:
# Get metrics from the last hour
curl "http://localhost:15133/v1/metrics?startTime=$(date -d '1 hour ago' +%s)"
# Get metrics for specific component
curl "http://localhost:15133/v1/metrics?components=accelerator-nvidia-temperature"curl http://localhost:15133/v1/eventsReturns event data from all monitored components (errors, warnings, state changes)
Query Parameters:
since: Unix timestamp to retrieve events since a specific time (default: last hour)components: Filter events by component name
Example:
# Get events from the last hour
curl "http://localhost:15133/v1/events?since=$(date -d '1 hour ago' +%s)"
# Get events for specific component
curl "http://localhost:15133/v1/events?components=accelerator-nvidia-error-xid"curl http://localhost:15133/metricsReturns metrics in Prometheus exposition format for integration with monitoring systems
By default, fleetint uses a Unix socket for security. To allow external monitoring tools like Prometheus to scrape metrics over the network, switch to a TCP listener with the --listen-address flag:
# Expose on all interfaces
sudo fleetint run --listen-address=0.0.0.0:15133
# Or expose on a specific IP address
sudo fleetint run --listen-address=192.168.1.100:15133Configure Prometheus to scrape the exposed endpoint:
# prometheus.yml
scrape_configs:
- job_name: 'fleetint'
scrape_interval: 60s
static_configs:
- targets:
- 'gpu-server-1:15133'
- 'gpu-server-2:15133'
metrics_path: /metricsFor production deployments with persistent configuration and security considerations, see the Configuration Guide.
-
Check service status:
sudo systemctl status fleetintd
-
View recent logs:
sudo journalctl -u fleetintd -n 50
-
Verify NVIDIA drivers are installed (if using GPUs):
nvidia-smi
-
Check that the daemon is listening:
# Default (unix socket) sudo ls -la /run/fleetint/fleetint.sock # TCP mode sudo netstat -tlnp | grep 15133
-
Check the logs:
sudo journalctl -u fleetintd -f
-
Verify your configuration:
sudo fleetint metadata
-
Test connectivity to the export endpoint manually with
curl -
Check proxy settings in
/etc/default/fleetintif behind a firewall
The agent should use <100MB RAM and <1% CPU. Higher usage might indicate:
- Very frequent collection intervals (check
FLEETINT_COLLECT_INTERVAL) - Large lookback windows (check
FLEETINT_METRICS_LOOKBACKandFLEETINT_EVENTS_LOOKBACK) - Many GPUs in the system (resource usage scales with GPU count)
- Debug logging enabled (use
--log-level=warnorerror)