Skip to content

vpc-node-setup.sh writes "null" to shared volume on registration failure, no retry logic #29

@borisnieuwenhuis

Description

@borisnieuwenhuis

Problem

When /api/register returns an error response (e.g. 403 Forbidden), vpc-node-setup.sh writes the string "null" to /shared/server_url, /shared/pre_auth_key, and /shared/shared_key. The
VPC client then loops forever:

Received error: fetch control key: Get "null/key?v=123": unsupported protocol scheme ""

The script also has no retry logic — it runs once and exits, so there's no recovery if the VPC server isn't ready or the app isn't in the allowlist yet.

Root Cause

Two issues in scripts/vpc-node-setup.sh:

1. jq -r returns string "null" for missing fields, which passes the -z check:

PRE_AUTH_KEY=$(jq -r .pre_auth_key <<<"$RESPONSE")
# When RESPONSE is {"error":"Forbidden"}, jq -r .pre_auth_key outputs "null" (non-empty string)

if [ -z "$PRE_AUTH_KEY" ] || [ -z "$SHARED_KEY" ] || [ -z "$VPC_SERVER_URL" ]; then
    # This check PASSES because "null" is non-empty

2. No retry logic:

The script runs the registration request once. In orchestrated environments, the VPC server's ALLOWED_APPS may be updated after the node is deployed, creating a race condition where the node
always gets 403.

Reproduction

  1. Deploy a VPC node whose app_id is NOT yet in VPC_ALLOWED_APPS
  2. /api/register returns {"error":"Forbidden"}
  3. jq -r .server_url outputs null (string, not empty)
  4. Script writes null to /shared/server_url and exits reporting "VPC setup completed"
  5. VPC client loops forever on Get "null/key?v=123": unsupported protocol scheme ""

Suggested Fix

# Use jq '// empty' to return empty string instead of "null" for missing fields
PRE_AUTH_KEY=$(jq -r '.pre_auth_key // empty' <<<"$RESPONSE")
SHARED_KEY=$(jq -r '.shared_key // empty' <<<"$RESPONSE")
VPC_SERVER_URL=$(jq -r '.server_url // empty' <<<"$RESPONSE")

# Add retry loop for race conditions
MAX_RETRIES=30
RETRY_INTERVAL=10
for i in $(seq 1 $MAX_RETRIES); do
    RESPONSE=$(curl -s ...)
    PRE_AUTH_KEY=$(jq -r '.pre_auth_key // empty' <<<"$RESPONSE")
    # ... parse other fields ...
    if [ -n "$PRE_AUTH_KEY" ] && [ -n "$SHARED_KEY" ] && [ -n "$VPC_SERVER_URL" ]; then
        break
    fi
    echo "Registration failed (attempt $i/$MAX_RETRIES): $RESPONSE"
    sleep $RETRY_INTERVAL
done

Environment

  • dstack-vpc: main branch
  • Phala CVM with cgroup v2
  • VPC server with VPC_ALLOWED_APPS set to specific app IDs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions