Skip to content

Latest commit

 

History

History
1256 lines (939 loc) · 54.7 KB

File metadata and controls

1256 lines (939 loc) · 54.7 KB

Brev Cloud Provider Integration Guide

For Cloud Infrastructure Providers Integrating with Brev


Table of Contents

  1. Integration Overview
  2. How Brev Discovers Your Inventory
  3. Instance Types: Your Compute Catalog
  4. Location Model
  5. GPU Normalization
  6. Credential and Authentication Model
  7. Instance Lifecycle Operations
  8. SSH Connectivity
  9. Firewall and Security Groups
  10. Instance Metadata and Tags
  11. Error Handling and Status Reporting
  12. Billing and Pricing
  13. Common Questions

1. Integration Overview

What Does Integration Mean?

When you integrate with Brev, you're allowing Brev's control plane to:

  1. Sync your available GPU instance types into Brev's catalog
  2. Provision instances on your infrastructure via API calls
  3. Manage instance lifecycle (start, stop, terminate) through your API
  4. Connect to running instances via SSH to configure them

What Brev Needs From You (Cloud Provider)

Requirement Purpose
Instance Type Listing API Discover your available instance types
Instance Lifecycle APIs Create, get, start, stop, terminate
API Credentials for Brev Authenticate Brev's calls to your API
SSH Key Injection Accept SSH public key at VM creation
SSH Access Control plane communication to VMs

Integration Architecture

System Architecture Diagram

┌────────────────────────────────────────────────────────────────────────────────────┐
│                              Brev Control Plane (dev-plane)                        │
│                                                                                    │
│  ┌──────────────────────────────────┐    ┌──────────────────────────────────────┐  │
│  │         Syncer Layer             │    │     Instance Service Layer           │  │
│  │    (Continuous Reconciliation)   │    │       (User-Triggered Actions)       │  │
│  │                                  │    │                                      │  │
│  │  ┌────────────────────────────┐  │    │  ┌────────────────────────────────┐  │  │
│  │  │  InstanceTypeSyncer        │  │    │  │  Instance Lifecycle            │  │  │
│  │  │  ─────────────────────     │  │    │  │  ─────────────────────         │  │  │
│  │  │  Calls:                    │  │    │  │  Calls:                        │  │  │
│  │  │  • GetInstanceTypes()      │  │    │  │  • CreateInstance()            │  │  │
│  │  │  • GetLocations()          │  │    │  │  • TerminateInstance()         │  │  │
│  │  │  • GetInstanceTypePollTime │  │    │  │  • StopInstance()              │  │  │
│  │  │                            │  │    │  │  • StartInstance()             │  │  │
│  │  │  Interval: 1-5 min         │  │    │  │                                │  │  │
│  │  └────────────┬───────────────┘  │    │  └──────────────┬─────────────────┘  │  │
│  │               │                  │    │                 │                    │  │
│  │  ┌────────────┴───────────────┐  │    │  ┌──────────────┴─────────────────┐  │  │
│  │  │  InstanceSyncer            │  │    │  │  Instance State & Queries      │  │  │
│  │  │  ─────────────────────     │  │    │  │  ─────────────────────         │  │  │
│  │  │  Calls:                    │  │    │  │  Calls:                        │  │  │
│  │  │  • ListInstances()         │  │    │  │  • GetInstance()               │  │  │
│  │  │                            │  │    │  │  • ListInstances()             │  │  │
│  │  │  Interval: 5 sec           │  │    │  │  • AddFirewallRulesToInstance  │  │  │
│  │  └────────────┬───────────────┘  │    │  │  • ResizeInstanceVolume()      │  │  │
│  │               │                  │    │  │  • UpdateInstanceTags()        │  │  │
│  └───────────────┼──────────────────┘    │  └──────────────┬─────────────────┘  │  │
│                  │                       └─────────────────┼────────────────────┘  │
│                  │                                         │                       │
└──────────────────┼─────────────────────────────────────────┼───────────────────────┘
                   │                                         │
                   │       ┌─────────────────────────────────┘
                   │       │
                   ▼       ▼
┌────────────────────────────────────────────────────────────────────────────────────┐
│                           CLOUD SDK (v1) - This Repo                               │
│                                                                                    │
│  ┌──────────────────────────────────────────────────────────────────────────────┐  │
│  │                         CloudClient Interface                                │  │
│  │  (Composed of: CloudCredential, CloudBase, CloudQuota, CloudStopStart,       │  │
│  │   CloudReboot, CloudResizeVolume, CloudModifyFirewall, CloudInstanceTags...) │  │
│  └──────────────────────────────────────────────────────────────────────────────┘  │
│                                                                                    │
│  ┌──────────────────────────────────────────────────────────────────────────────┐  │
│  │                        Provider Implementations                              │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐            │  │
│  │  │ Lambda   │ │ Fluidstk │ │ Shadefrm │ │  Nebius  │ │  Your    │            │  │
│  │  │ Labs     │ │          │ │          │ │          │ │ Provider │   • • •    │  │
│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘            │  │
│  └───────┼────────────┼────────────┼────────────┼────────────┼──────────────────┘  │
│          │            │            │            │            │                     │
└──────────┼────────────┼────────────┼────────────┼────────────┼─────────────────────┘
           │            │            │            │            │
           ▼            ▼            ▼            ▼            ▼
┌────────────────────────────────────────────────────────────────────────────────────┐
│                            CLOUD PROVIDER APIs                                     │
│                                                                                    │
│  Each provider's native REST/gRPC API for instance management                      │
└────────────────────────────────────────────────────────────────────────────────────┘

2. How Brev Discovers Your Inventory

The Instance Type Syncer

Brev runs a continuous synchronization process that periodically queries your API to understand what compute is available.

Sync Behavior:

  • Polls your instance type listing API at a configurable interval you define via GetInstanceTypePollTime() (default: 1 minute; existing implementations use 1-5 minutes depending on provider needs)
  • Compares current catalog to previous state
  • Updates availability, pricing, and specs as they change
  • Marks types as unavailable when removed from your API
  • Adds new types when they appear

What We Query

We need an API endpoint that returns your available instance types. For each type, we map your data to the v1.InstanceType struct (defined in cloud/v1/instancetype.go):

Core Instance Type Fields:

Struct Field Type Description Example
Type string Your internal type name "gpu_1x_a100_80gb_sxm4"
Location string Region identifier "us-west-1"
VCPU int32 vCPU count 128
MemoryBytes Bytes RAM (use v1.NewBytes()) v1.NewBytes(1024, v1.Gibibyte)
BasePrice *currency.Amount Hourly price in USD currency.NewAmountFromInt64(3200, "USD") (= $32.00/hr)
IsAvailable bool Currently launchable true

GPU Details (SupportedGPUs []GPU):

Struct Field Type Description Example
Count int32 Number of GPUs 8
Name string GPU model name "A100"
MemoryBytes Bytes VRAM per GPU v1.NewBytes(80, v1.Gibibyte)
NetworkDetails string Interconnect type "SXM4", "PCIe"
Manufacturer Manufacturer GPU vendor v1.ManufacturerNVIDIA

Storage Details (SupportedStorage []Storage):

Struct Field Type Description Example
SizeBytes Bytes Disk size v1.NewBytes(2000, v1.Gibibyte)
Type string Storage type "ssd", "nvme"
PricePerGBHr *currency.Amount Additional storage cost nil (if included in base price)

Example: Converting Provider Data to v1.InstanceType

From Lambda Labs implementation (cloud/v1/providers/lambdalabs/instancetype.go):

it := v1.InstanceType{
    Location:      location,
    Type:          instType.Name,                                           // "gpu_1x_a100_80gb_sxm4"
    SupportedGPUs: []v1.GPU{{
        Count:       8,
        Name:        "A100",
        MemoryBytes: v1.NewBytes(80, v1.Gibibyte),
        NetworkDetails: "SXM4",
        Manufacturer: v1.ManufacturerNVIDIA,
    }},
    SupportedStorage: []v1.Storage{{
        Type:      "ssd",
        SizeBytes: v1.NewBytes(instType.Specs.StorageGib, v1.Gibibyte),
    }},
    VCPU:        instType.Specs.Vcpus,
    MemoryBytes: v1.NewBytes(instType.Specs.MemoryGib, v1.Gibibyte),
    BasePrice:   &amount,
    IsAvailable: isAvailable,
    Provider:    CloudProviderID,
    Cloud:       CloudProviderID,
}
it.ID = v1.MakeGenericInstanceTypeID(it)  // Generate ID using helper (or set your own)

API Type Declaration

When implementing the Cloud SDK, you declare how Brev's control plane should query your integration via GetAPIType():

API Type Meaning Control Plane Behavior
APITypeGlobal Your GetInstanceTypes() returns all regions in one call Brev calls once with locations = ["all"]
APITypeLocational Your GetInstanceTypes() is region-scoped Brev iterates over GetLocations() results

You handle the mapping internally. The SDK doesn't call your API directly—your implementation does. Whether your cloud's native API is regional, global, or something else entirely, you write the conversion logic in GetInstanceTypes().

Example: Global API (Lambda Labs) Lambda Labs' API returns all instance types with regional availability embedded. The SDK implementation fetches once and expands to per-region v1.InstanceType entries:

// Simplified from cloud/v1/providers/lambdalabs/instancetype.go
func (c *LambdaLabsClient) GetInstanceTypes(ctx context.Context, args v1.GetInstanceTypeArgs) ([]v1.InstanceType, error) {
    resp, _ := c.client.InstanceTypes(ctx)  // Single API call returns all types
    
    // Expand each type to all its available regions
    for _, instType := range resp.Data {
        for _, region := range locations {
            isAvailable := slices.Contains(instType.RegionsWithCapacityAvailable, region.Name)
            instanceTypes = append(instanceTypes, convertToV1(region.Name, instType, isAvailable))
        }
    }
    return instanceTypes, nil
}

Example: Locational API (Nebius) Nebius requires per-region quota checks. The SDK implementation iterates regions internally:

// Simplified from cloud/v1/providers/nebius/instancetype.go
func (c *NebiusClient) GetInstanceTypes(ctx context.Context, args v1.GetInstanceTypeArgs) ([]v1.InstanceType, error) {
    platforms, _ := c.sdk.Compute().Platform().List(ctx, c.projectID)
    
    for _, location := range locations {
        // Check quota per-region
        isAvailable := c.checkQuotaAvailability(platform, location.Name, quotaMap)
        instanceTypes = append(instanceTypes, convertToV1(location.Name, platform, isAvailable))
    }
    return instanceTypes, nil
}

Key point: You decide how to call your cloud's API. Brev only cares that GetInstanceTypes() returns properly formatted v1.InstanceType entries with accurate Location and IsAvailable fields.


3. Instance Types: Your Compute Catalog

What Is an Instance Type to Brev?

Brev treats compute as inventory. Each instance type represents a distinct compute configuration in your catalog. Users browse your instance types filtered by GPU, region, price, and availability.

The Canonical Instance Type Model

When we ingest your instance types, we normalize them to the v1.InstanceType struct. Here are the key fields (see cloud/v1/instancetype.go for the complete definition):

Field Type Description
ID InstanceTypeID Stable, unique identifier (you define the format—see below)
Cloud string Your cloud identifier (e.g., "lambdalabs", "crusoe")
Provider string Provider identifier (often same as Cloud)
Type string Your native type name
Location string Primary region identifier
SubLocation string Availability zone (optional; helper uses "noSub" if empty)
AvailableAzs []string All zones where this type is available
SupportedGPUs []GPU GPU details (see GPU struct below)
VCPU int32 vCPU count
MemoryBytes Bytes RAM (use v1.NewBytes() helper)
SupportedStorage []Storage Storage options (see Storage struct)
BasePrice *currency.Amount Hourly price in USD
IsAvailable bool Currently launchable
Stoppable bool Can instances be stopped/resumed
Rebootable bool Can instances be rebooted

The GPU struct (cloud/v1/instancetype.go):

Field Type Description
Count int32 Number of GPUs
Name string GPU model name (e.g., "A100", "H100")
Type string Full GPU type (e.g., "A100.SXM4")
MemoryBytes Bytes VRAM per GPU
MemoryDetails string Memory type: "HBM", "GDDR", etc.
NetworkDetails string Interconnect: "PCIe", "SXM4", "SXM5"
Manufacturer Manufacturer ManufacturerNVIDIA, ManufacturerIntel, etc.

The Storage struct (cloud/v1/storage.go):

Field Type Description
Count int32 Number of disks
SizeBytes Bytes Disk size
Type string Storage type (e.g., "ssd", "nvme")
PricePerGBHr *currency.Amount Additional storage cost (if applicable)
IsEphemeral bool Lost on stop/terminate

Instance Type ID

The ID field must be a stable, unique identifier for each instance type across all regions. You control the format.

Requirements:

  • Stable: The same instance type must return the same ID on every sync
  • Unique: No two instance types can share an ID
  • Deterministic: IDs must not change between API calls

Option 1: Use the Helper Function

The SDK provides MakeGenericInstanceTypeID() which generates IDs using this pattern:

{location}-{subLocation}-{type}

If your instance type has no sublocation, the helper uses "noSub" as a placeholder.

// Set all fields first, then call the helper at the END
it := v1.InstanceType{
    Location: "us-west-1",
    Type:     "gpu_1x_a100",
    // ... other fields
}
it.ID = v1.MakeGenericInstanceTypeID(it)  // Result: "us-west-1-noSub-gpu_1x_a100"

Option 2: Define Your Own Format

If you prefer a different ID format, set ID directly:

// Shadeform uses: {cloud}_{instanceType}_{region}
it := v1.InstanceType{
    ID:       v1.InstanceTypeID("massedcompute_L40_desmoines-usa-1"),
    Location: "desmoines-usa-1",
    Type:     "massedcompute_L40",
    // ... other fields
}

Why Stability Matters:

Brev uses this ID to track inventory and match provisioning requests. If your IDs change between syncs, Brev loses the ability to correlate instance types correctly.

CRITICAL: ID Consistency Between InstanceType and Instance

Warning: This is the most common cause of integration failures. Instance types may sync successfully but instances fail to provision or appear "orphaned."

When Brev provisions an instance, it looks up the corresponding instance type using the instance's InstanceTypeID. These IDs must match exactly.

A Common Problem:

The SDK has two helper functions that generate IDs differently:

Function Used For SubLocation Source
MakeGenericInstanceTypeID() InstanceType structs AvailableAzs[0] (first AZ)
MakeGenericInstanceTypeIDFromInstance() Instance structs SubLocation field

If AvailableAzs[0] and SubLocation don't match, the IDs diverge and lookup fails.

The Mistakes:

// WRONG - Manually setting InstanceTypeID
inst := &v1.Instance{
    InstanceType:   "gpu-h100-8x",
    InstanceTypeID: v1.InstanceTypeID("gpu-h100-8x"),  // BUG: Missing location!
}

// WRONG - Inconsistent SubLocation vs AvailableAzs
instanceType := v1.InstanceType{
    Location:     "us-east-1",
    SubLocation:  "us-east-1a",      // Set to "us-east-1a"
    AvailableAzs: []string{"us-east-1b"},  // But AZs has "us-east-1b"!
}

The Fix:

  1. For InstanceType: Set all fields first, then call MakeGenericInstanceTypeID() at the END
  2. For Instance: Set all fields first, then call MakeGenericInstanceTypeIDFromInstance() at the END
  3. Ensure consistency: If you set both SubLocation and AvailableAzs, make sure SubLocation == AvailableAzs[0]
// CORRECT - InstanceType
it := v1.InstanceType{
    Location:     "us-east-1",
    AvailableAzs: []string{"us-east-1a"},
    Type:         "gpu-h100-8x",
    // ... other fields
}
it.ID = v1.MakeGenericInstanceTypeID(it)  // LAST

// CORRECT - Instance
inst := &v1.Instance{
    Location:     "us-east-1",
    SubLocation:  "us-east-1a",  // Matches the AZ
    InstanceType: "gpu-h100-8x",
    // ... other fields
}
inst.InstanceTypeID = v1.MakeGenericInstanceTypeIDFromInstance(*inst)  // LAST

Symptoms of ID Mismatch:

  • Instance types sync successfully but don't appear in the Brev catalog
  • CreateInstance succeeds but subsequent operations fail
  • "instance type not found" errors during provisioning
  • Instances appear "orphaned" (no associated instance type)

Validating Your Instance Type IDs

The SDK provides validation functions to catch ID generation issues early. Run these in your test suite:

1. ValidateStableInstanceTypeIDs - Ensures your instance type IDs are stable and unique:

// In your validation tests
err := v1.ValidateStableInstanceTypeIDs(ctx, client, stableIDs)
require.NoError(t, err, "ValidateStableInstanceTypeIDs should pass")

This validates:

  • Each instance type ID is unique (no duplicates)
  • Your designated stable IDs exist in the current instance types
  • All instance types have required properties (base price, storage pricing)

2. ValidateCreateInstance - Validates that instance and instance type IDs match:

// In your validation tests
instance, err := v1.ValidateCreateInstance(ctx, client, attrs, selectedType)
require.NoError(t, err, "ValidateCreateInstance should pass")

This validates (among other things):

  • instance.InstanceTypeID == selectedType.IDcatches ID generation mismatches
  • instance.RefID matches the provided RefID
  • Location and instance type fields are consistent

Why this matters: If MakeGenericInstanceTypeID() and MakeGenericInstanceTypeIDFromInstance() produce different IDs for the same logical type, the control plane cannot correlate instances with their types. ValidateCreateInstance catches this.

See internal/validation/suite.go for the full validation test suite you can use as a reference.

4. Location Model

The Location Hierarchy

Brev uses a three-level location model to represent where compute resources exist:

Level Field Description Example
Region Location Primary geographic region "us-west-1", "europe-west4"
Availability Zone SubLocation Specific zone within a region "us-west-1a", "europe-west4-b"
Available Zones AvailableAzs All zones where this type can launch ["us-west-1a", "us-west-1b"]

Note: The distinction between these fields can be confusing. Location is the region, SubLocation is a specific zone (used for instances), and AvailableAzs lists all zones where an instance type is available (used for instance types).

The Location Struct

When implementing GetLocations(), you return a list of Location structs (defined in cloud/v1/location.go):

Field Type Description
Name string Region identifier (acts as the ID)
Description string Human-readable name
Available bool Whether the region is currently operational
Endpoint string API endpoint for this region (if applicable)
Priority int Preference order for region selection
Country string ISO 3166-1 alpha-3 country code

Availability on Instance Types

Availability is tracked per instance type using two fields on the InstanceType struct:

Field Type Meaning
IsAvailable bool Whether this type can currently be launched
AvailableAzs []string Which availability zones have capacity

Interpreting Availability:

  • IsAvailable: true + AvailableAzs: ["us-west-1a", "us-west-1b"] = Can launch in either AZ
  • IsAvailable: false = Type exists but is currently out of stock or disabled
  • Empty AvailableAzs with IsAvailable: true = Region-level availability only (no AZ granularity)

5. GPU Normalization

The GPU Struct

The Cloud SDK represents GPUs with these fields:

type GPU struct {
    Name           string           // Base model: "H100", "A100", "L40S"
    Count          int32            // Number of GPUs
    Memory         units.Base2Bytes // VRAM per GPU (deprecated, use MemoryBytes)
    MemoryBytes    Bytes            // VRAM per GPU in structured format
    MemoryDetails  string           // Memory type: "HBM2", "HBM3", "HBM2e", "GDDR"
    NetworkDetails string           // Form factor: "PCIe", "SXM", "SXM4", "SXM5"
    Manufacturer   Manufacturer     // "NVIDIA", "AMD", "Intel"
    Type           string           // Optional: original type identifier
}

Implementer Responsibility

You are responsible for normalizing GPU data. Brev does not automatically parse GPU descriptions. Your GetInstanceTypes must populate the GPU struct.

Field Example Notes
Name "H100", "A100" Base model, uppercase
Count 8 GPUs per instance
MemoryBytes v1.NewBytes(80, v1.Gibibyte) VRAM per GPU
NetworkDetails "SXM4", "PCIe" Form factor
Manufacturer "NVIDIA"

Provider Examples

Lambda Labs (cloud/v1/providers/lambdalabs/instancetype.go:parseGPUFromDescription)

Parses "8x A100 (40 GB SXM4)" using regex:

gpu.Count = int32(count)           // from (\d+)x
gpu.Name = nameStr                 // from x (.*?) \(
gpu.MemoryBytes = v1.NewBytes(v1.BytesValue(memoryGiB), v1.Gibibyte)
gpu.NetworkDetails = networkDetails // remainder after "GB"
gpu.Manufacturer = "NVIDIA"

Launchpad (cloud/v1/providers/launchpad/instancetype.go:launchpadGpusToGpus)

Maps structured API fields:

gpus[i] = v1.GPU{
    Name:           strings.ToUpper(gp.Family),
    Count:          gp.Count,
    MemoryBytes:    v1.NewBytes(v1.BytesValue(gp.MemoryGb), v1.Gigabyte),
    NetworkDetails: string(gp.InterconnectionType),
    Manufacturer:   v1.GetManufacturer(gp.Manufacturer),
}

Key Points

  • Name: base model only ("H100" not "NVIDIA H100 80GB")
  • NetworkDetails: "SXM", "SXM4", "SXM5", or "PCIe"
  • Manufacturer: always set to "NVIDIA"

6. Credential and Authentication Model

How Brev Authenticates to Your API

Brev stores credentials for your cloud provider and uses them to make API calls. This is a direct relationship between Brev's control plane and your cloud API.

What You Need to Provide

Requirement Details
API Credentials A JSON-serializable Go struct containing your authentication fields (API key, token, service account, etc.)
Authentication Endpoint How Brev authenticates (API key header, OAuth, etc.)

Credential Storage Model

Credentials are stored in Brev's control plane database as raw JSON (json.RawMessage). This means your credential struct must be JSON-serializable with proper struct tags.

How it works:

  1. You define a credential struct with JSON tags for each field
  2. Brev stores the struct as raw JSON bytes in the database (encrypted at rest)
  3. Brev deserializes the JSON back into your struct type when making API calls

Example credential struct:

type MyProviderCredential struct {
    RefID  string            // Set by Brev (the cloud_cred ID)
    APIKey string `json:"api_key"`
    Region string `json:"region,omitempty"`  // Optional fields use omitempty
}

Key requirements:

  • All fields you need serialized must have json:"field_name" tags
  • The RefID field is set by Brev after storage (it's the database record ID)
  • Use json:"...,omitempty" for optional fields
  • The struct must implement the CloudCredential interface

Credential Exchange Process

  1. You provide API credentials to Brev during integration setup
  2. Brev stores credentials securely (encrypted at rest)
  3. Brev uses credentials to call your API for sync and provisioning

Credential Types

Providers define their own credential struct with whatever fields they need. The struct fields use JSON tags that determine the field names in the stored JSON.

Provider Struct Fields JSON Fields
Lambda Labs APIKey string api_key
Shadeform APIKey string api_key
FluidStack APIKey string api_key
AWS AccessKeyID, SecretAccessKey access_key_id, secret_access_key
Nebius ServiceAccountKey, TenantID service_account_key, tenant_id
Launchpad APIToken, APIURL api_token, api_url

Complete credential struct example (from Launchpad):

type LaunchpadCredential struct {
    RefID    string            // Not serialized - set by Brev after storage
    APIToken string `json:"api_token"`
    APIURL   string `json:"api_url"`
}

var _ v1.CloudCredential = &LaunchpadCredential{}  // Compile-time interface check

func (c *LaunchpadCredential) Validate() error {
    return validation.ValidateStruct(c,
        validation.Field(&c.APIToken, validation.Required),
        validation.Field(&c.APIURL, validation.Required),
    )
}

Your credential struct must implement the CloudCredential interface, which requires these methods:

type CloudCredential interface {
    MakeClient(ctx context.Context, location string) (CloudClient, error)
    GetTenantID() (string, error)
    GetReferenceID() string
    GetAPIType() APIType
    GetCapabilities(ctx context.Context) (Capabilities, error)
    GetCloudProviderID() CloudProviderID
}

SSH Keys (Separate from API Credentials)

SSH keys are passed at instance creation time via the PublicKey field in CreateInstanceAttrs.

Your implementation must:

  1. Accept this public key in your create instance API
  2. Install it in the VM's default user ~/.ssh/authorized_keys before the instance becomes accessible

Brev manages SSH keys per user. The public key provided in CreateInstanceAttrs.PublicKey belongs to the user, and the control plane retains the corresponding private key to connect after creation.


7. Instance Lifecycle Operations

This section describes each lifecycle operation, its requirements, and expected behavior. Not all operations are required—providers declare their capabilities via GetCapabilities().

Lifecycle States

The SDK defines these states in LifecycleStatus (from cloud/v1/instance.go):

State Meaning
pending Create initiated, VM provisioning
running Instance is up with a public IP
stopping Stop requested, shutting down
stopped Powered off, storage preserved
suspending Suspend requested
suspended Hibernated state
terminating Terminate requested
terminated Instance destroyed
failed Provisioning or operation failed

Create Instance (Required)

Interface: CloudCreateTerminateInstance.CreateInstance(ctx, CreateInstanceAttrs) (*Instance, error)

Contract:

  • On success: Return an *Instance with a valid CloudID. The instance must exist in your system.
  • On error: Return an error and ensure no instance was created. Brev will not attempt cleanup on errors.

Key input fields from CreateInstanceAttrs:

Field Type Required Description
RefID string Yes Brev's reference ID; use for idempotency
InstanceType string Yes Your instance type name
Location string Yes Region to launch in
SubLocation string No Specific availability zone
PublicKey string Yes SSH public key (OpenSSH format)
Name string No Display name for the instance
ImageID string No OS image; use your default if empty
DiskSize units.Base2Bytes No Boot disk size
FirewallRules FirewallRules No Ports to open (SSH port is always required)
Tags Tags No Key-value metadata
UserDataBase64 string No Cloud-init or startup script

Key output fields on Instance:

Field When Required Description
CloudID Always Your unique instance identifier
Status.LifecycleStatus Always Current state (pending or running)
Location Always Region where launched
InstanceType Always Instance type that was provisioned
PublicIP When running Public IPv4 for SSH access
SSHUser Always Username for SSH (e.g., ubuntu, root)
SSHPort Always SSH port (typically 22)
RefID Always Echo back the input RefID

Example flow (from Lambda Labs implementation):

// 1. Register the SSH key with your API
keyPairResp, err := c.addSSHKey(ctx, openapi.AddSSHKeyRequest{
    Name:      attrs.RefID,
    PublicKey: &attrs.PublicKey,
})

// 2. Launch the instance with the key
resp, err := c.launchInstance(ctx, openapi.LaunchInstanceRequest{
    RegionName:       attrs.Location,
    InstanceTypeName: attrs.InstanceType,
    SshKeyNames:      []string{keyPairName},
})

// 3. Return instance details
return c.GetInstance(ctx, v1.CloudProviderInstanceID(resp.Data.InstanceIds[0]))

Terminate Instance (Required)

Interface: CloudCreateTerminateInstance.TerminateInstance(ctx, instanceID) error

Contract:

  • Initiate instance termination. Storage may or may not be preserved (provider-dependent).
  • Return nil on success, even if the instance is already terminated.
  • The instance should eventually reach terminated state.

Idempotency: Should succeed if called multiple times on the same instance.

Stop Instance (Optional)

Capability: CapabilityStopStartInstance

Interface: CloudStopStartInstance.StopInstance(ctx, instanceID) error

Contract:

  • Power off the instance while preserving storage.
  • Return nil once the stop operation is initiated.
  • Instance should transition: runningstoppingstopped

When to implement: Only if your platform supports instances that can stop and preserve storage. Lambda Labs does not support this, but Nebius does.

Start Instance (Optional)

Capability: CapabilityStopStartInstance

Interface: CloudStopStartInstance.StartInstance(ctx, instanceID) error

Contract:

  • Power on a previously stopped instance.
  • Return nil once the start operation is initiated.
  • Instance should transition: stoppedpendingrunning

Note: If you implement StopInstance, you must also implement StartInstance.

Stop/Start: Three Levels of Control

Stop/start support is controlled at three levels:

Level What to Set Purpose
Provider Capability CapabilityStopStartInstance in GetCapabilities() Indicates your API supports stop/start operations
Instance Type InstanceType.Stoppable = true/false Indicates whether this instance type can be stopped (e.g., spot instances typically cannot)
Instance Instance.Stoppable = true/false Indicates whether this specific instance can be stopped

Example - Nebius (supports stop/start):

// In GetCapabilities()
v1.CapabilityStopStartInstance,  // API supports it

// In GetInstanceTypes() - instance type level
instanceType := v1.InstanceType{
    Stoppable: true,  // This type supports stop/start
    // ...
}

// In GetInstance()/CreateInstance() - instance level
instance := v1.Instance{
    Stoppable: true,  // This instance can be stopped
    // ...
}

Example - Lambda Labs (no stop/start support):

// In GetCapabilities()
// CapabilityStopStartInstance NOT included

// In GetInstanceTypes()
instanceType := v1.InstanceType{
    Stoppable: false,  // Cannot be stopped
    // ...
}

// In GetInstance()/CreateInstance()
instance := v1.Instance{
    Stoppable: false,  // Cannot be stopped
    // ...
}

The control plane checks all three levels before allowing a stop/start operation. If any level indicates false, the operation won't be attempted.

Get Instance (Required)

Interface: CloudInstanceReader.GetInstance(ctx, instanceID) (*Instance, error)

Contract:

  • Return current state of the instance.
  • Return ErrResourceNotFound if the instance doesn't exist.

List Instances (Required)

Interface: CloudInstanceReader.ListInstances(ctx, ListInstancesArgs) ([]Instance, error)

Contract:

  • Return all instances matching the filter criteria.
  • Used by the Instance Syncer to reconcile state (called every ~5 seconds).

Capability Declaration

Your credential's GetCapabilities() must return the capabilities you support:

func (c *MyCredential) GetCapabilities(ctx context.Context) (v1.Capabilities, error) {
    return v1.Capabilities{
        v1.CapabilityCreateInstance,           // Required
        v1.CapabilityTerminateInstance,        // Required
        v1.CapabilityCreateTerminateInstance,  // Required (composite)
        // Optional:
        v1.CapabilityStopStartInstance,        // If you support stop/start
        v1.CapabilityRebootInstance,           // If you support reboot
        v1.CapabilityTags,                     // If your API supports instance tags/labels (see Section 10)
        v1.CapabilityModifyFirewall,           // If you support dynamic firewall rules
        v1.CapabilityResizeInstanceVolume,     // If you support volume resizing
    }, nil
}

Brev checks capabilities before calling optional methods. If you don't declare a capability, Brev won't attempt that operation.

Note on CapabilityTags: This capability is optional, but RefID and CloudCredRefID data is required regardless. If your API doesn't support tags, you must use an alternative mechanism to store and retrieve this data. See Section 10: Instance Metadata and Tags for details and examples.


8. SSH Connectivity

Core Requirement

Brev's control plane must be able to connect to your instances via SSH using the provided keys. This is the only hard requirement for network connectivity.

After your VM is running, Brev connects via SSH to:

  1. Configure the environment: Install Brev agent, set up development tools
  2. Enable connections: Set up tunnels and connection paths for users
  3. Manage instance: Execute commands, transfer files, health checks

What You Provide at Launch

When provisioning, we pass:

  • SSH public key: Key to install in authorized_keys (via CreateInstanceAttrs.PublicKey)
  • Firewall rules: Ports to open (see Section 9)

Instance Requirements

Your instances must return these fields so Brev can connect:

Field Required Description
SSHUser Yes Username for SSH (e.g., ubuntu, root, ec2-user)
SSHPort Yes SSH port (commonly 22, but can be any port)
PublicIP Yes Publicly routable address for SSH connection

Note: While PublicIP is the required field, public routing via DNS also works in practice. The key requirement is that Brev can reach your instance over SSH.

SSH User

Brev connects as the default user your image provides:

Image Default User
Ubuntu ubuntu
Debian admin or debian
Amazon Linux ec2-user
Custom Whatever you configure

Runtime Requirements

Requirement Details
SSHD running On the port specified by Instance.SSHPort
Port publicly reachable No NAT or firewall blocking inbound SSH
Key installed The public key from CreateInstanceAttrs.PublicKey in authorized_keys

9. Firewall and Security Groups

Can you dynamically expose ports at instance creation? Yes, if you support user-data or have a native firewall API.

Can you modify firewall rules after creation without SSH/reboot? Only if you have a native API. Most GPU clouds don't.

SDK Structures

type FirewallRules struct {
    IngressRules []FirewallRule
    EgressRules  []FirewallRule
}

type FirewallRule struct {
    FromPort int32
    ToPort   int32
    IPRanges []string // CIDR notation
}

Passed via CreateInstanceAttrs.FirewallRules.

If You Have a Native API

Use it. Implement CloudModifyFirewall for post-creation changes:

type CloudModifyFirewall interface {
    AddFirewallRulesToInstance(ctx context.Context, args AddFirewallRulesToInstanceArgs) error
    RevokeSecurityGroupRules(ctx context.Context, args RevokeSecurityGroupRuleArgs) error
}

Add CapabilityModifyFirewall to your capabilities.

If You Only Have User-Data

Inject UFW + iptables commands at boot. Reference implementation: cloud/v1/providers/shadeform/firewall.go.

UFW Commands - Host-level firewall:

// Core UFW pattern
commands := []string{
    "ufw --force reset",          // Reset to clean state
    "ufw default deny incoming",  // Default deny incoming
    "ufw default allow outgoing", // Default allow outgoing
    "ufw allow 22/tcp",           // Always allow SSH
    "ufw allow 2222/tcp",         // Allow alternate SSH port
}

// Add ingress rules
for _, rule := range firewallRules.IngressRules {
    if rule.FromPort == rule.ToPort {
        // Single port
        if len(rule.IPRanges) == 0 {
            commands = append(commands, fmt.Sprintf("ufw allow in from any to any port %d", rule.FromPort))
        } else {
            for _, cidr := range rule.IPRanges {
                commands = append(commands, fmt.Sprintf("ufw allow in from %s to any port %d", cidr, rule.FromPort))
            }
        }
    } else {
        // Port ranges require separate tcp/udp rules
        for _, proto := range []string{"tcp", "udp"} {
            portSpec := fmt.Sprintf("port %d:%d proto %s", rule.FromPort, rule.ToPort, proto)
            if len(rule.IPRanges) == 0 {
                commands = append(commands, fmt.Sprintf("ufw allow in from any to any %s", portSpec))
            } else {
                for _, cidr := range rule.IPRanges {
                    commands = append(commands, fmt.Sprintf("ufw allow in from %s to any %s", cidr, portSpec))
                }
            }
        }
    }
}

// Add egress rules (same pattern as ingress but with "ufw allow out to")
// ...

commands = append(commands, "ufw --force enable")

IPTables Commands - Block Docker from bypassing UFW:

Docker manipulates iptables directly, bypassing UFW. The DOCKER-USER chain is the official hook point for custom rules that Docker respects.

// Required iptables commands to secure Docker containers
iptablesCommands := []string{
    "iptables -F DOCKER-USER",                                              // Reset chain
    "iptables -A DOCKER-USER -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT", // Allow responses
    "iptables -A DOCKER-USER -i lo -j ACCEPT",                              // Allow loopback
    "iptables -A DOCKER-USER -j DROP",                                      // Drop all other inbound
    "iptables -A DOCKER-USER -j RETURN",                                    // Required by Docker
}

Without these iptables rules, a Docker container listening on 0.0.0.0:PORT would be accessible from the internet even if UFW blocks that port.

Full script generation:

// Combine UFW + iptables commands
allCommands := append(ufwCommands, iptablesCommands...)

script := ""
for _, cmd := range allCommands {
    script += fmt.Sprintf("%v\n", cmd)
}

encoded := base64.StdEncoding.EncodeToString([]byte(script))

Validation Tests:

The SDK validates firewall behavior with two tests in cloud/internal/validation/suite.go:

  • ValidateFirewallBlocksPort - Verifies UFW blocks non-allowed ports (tests port 9999 by default)
  • ValidateDockerFirewallBlocksPort - Verifies iptables DOCKER-USER chain blocks Docker container ports

These run as part of RunInstanceLifecycleValidation and RunFirewallValidation. See cloud/v1/networking_validation.go for the test implementations.

Do not implement CloudModifyFirewall. Return ErrNotImplemented.

If You Only Have IP Allowlists

See cloud/v1/providers/launchpad/instance_create.go. You can only restrict by source IP, not port. Extract /32s from the rules and pass to your API:

ips := []string{}
for _, rule := range firewallRules.IngressRules {
    for _, cidr := range rule.IPRanges {
        _, ipNet, _ := net.ParseCIDR(cidr)
        ones, bits := ipNet.Mask.Size()
        if ones == bits { // /32 only
            ips = append(ips, ipNet.IP.String())
        }
    }
}

10. Instance Metadata and Tags

Brev uses metadata to track and correlate instances. The control plane requires certain data to be persisted with instances and retrievable later.

Required Instance Data

These values MUST be stored with the instance and returned in GetInstance/ListInstances:

Field Purpose
RefID Instance correlation and idempotency (passed in CreateInstanceAttrs.RefID)
CloudCredRefID Identifies which credential created the instance (from GetReferenceID())

The CapabilityTags Capability

If your cloud provider's API supports instance tagging/labeling, declare v1.CapabilityTags in your capabilities:

func (c *MyCredential) GetCapabilities(ctx context.Context) (v1.Capabilities, error) {
    return v1.Capabilities{
        v1.CapabilityCreateInstance,
        v1.CapabilityTerminateInstance,
        v1.CapabilityTags,  // Declare this if your API supports tags/labels
    }, nil
}

When CapabilityTags is declared:

  • Store RefID, CloudCredRefID, and any additional tags via CreateInstanceAttrs.Tags
  • The control plane will call UpdateInstanceTags() to add metadata after creation
  • ListInstances() should support filtering via TagFilters for efficient queries

Example (Shadeform with tags):

// At creation - store RefID and CloudCredRefID as tags
refIDTag := fmt.Sprintf("refID=%s", attrs.RefID)
cloudCredRefIDTag := fmt.Sprintf("cloudCredRefID=%s", c.GetReferenceID())
tags := []string{refIDTag, cloudCredRefIDTag}

// When reading back - extract from tags
refID := tags["refID"]
cloudCredRefID := tags["cloudCredRefID"]

Alternative: When Tags Are NOT Supported

If your API doesn't support tags, you still must persist and return RefID and CloudCredRefID. Use creative alternatives:

Example (Lambda Labs without tags):

// At creation - encode CloudCredRefID in instance name
name := fmt.Sprintf("%s--%s", c.GetReferenceID(), time.Now().UTC().Format(timeFormat))
// Use RefID as the SSH key pair name
keyPairName := attrs.RefID

// When reading back - extract from name and SSH key
nameParts := strings.Split(instance.Name, "--")
cloudCredRefID := nameParts[0]
refID := instance.SshKeyNames[0]

Recommendation: Use Tags If Possible

Tags are the recommended and easiest integration path. They provide:

  • Clean separation of metadata from instance properties
  • Efficient server-side filtering via TagFilters
  • Full billing/usage tracking capabilities
  • Straightforward implementation

If your cloud API supports any form of instance tagging, labels, or metadata—use it.

Before implementing a custom solution, please reach out to the Brev team. We can help design an approach that works reliably with the control plane and avoid edge cases that could cause instance correlation issues.


11. Error Handling and Status Reporting

Error Categories

Your provider implementation should translate API errors into the standard error constants defined in v1/errors.go:

Category Examples Return This Error Constant
Out of Stock No capacity in region v1.ErrInsufficientResources
Quota Exceeded Hit account limit v1.ErrOutOfQuota
Resource Not Found Instance/image doesn't exist v1.ErrResourceNotFound, v1.ErrInstanceNotFound, v1.ErrImageNotFound
Service Unavailable API temporarily down v1.ErrServiceUnavailable
Auth Failed Bad API key Return HTTP 401/403 error
Internal Error Your system issue Return error with HTTP 500 details

Reference: See v1/errors.go for the full list of error constants:

var (
	ErrInsufficientResources = errors.New("zone has insufficient resources to fulfill the request, InsufficientCapacity")
	ErrOutOfQuota            = errors.New("out of quota in the region fulfill the request, InsufficientQuota")
	ErrImageNotFound         = errors.New("image not found")
	ErrDuplicateFirewallRule = errors.New("duplicate firewall rule")
	ErrInstanceNotFound      = errors.New("instance not found")
	ErrResourceNotFound      = errors.New("resource not found")
	ErrServiceUnavailable    = errors.New("api is temporarily unavailable")
)

Out of Stock Handling

"Out of stock" is common with GPUs. Your implementation should return v1.ErrInsufficientResources:

  1. Your API returns your specific "no capacity" error
  2. Your provider translates this to v1.ErrInsufficientResources
  3. Brev marks that type as temporarily unavailable in that region
  4. The syncer will re-check availability on the next poll

Example from Shadeform provider (v1/providers/shadeform/instance.go):

if shadeformErrorResponse.ErrorCode == outOfStockErrorCode {
    return v1.ErrInsufficientResources
}

Example from Lambda Labs provider (v1/providers/lambdalabs/errors.go):

if strings.Contains(e.Error(), "Not enough capacity") || strings.Contains(e.Error(), "insufficient-capacity") {
    return v1.ErrInsufficientResources
}

12. Billing and Pricing

Billing

Billing arrangements are handled separately during the integration partnership setup. In most cases, this simply means Brev creates an account on your cloud platform with a credit card on file. There is no special billing integration or reconcillation process required.

How Pricing Works

Brev displays your prices via InstanceType.BasePrice (see v1/instancetype.go).

Field Type Notes
BasePrice *currency.Amount From github.com/bojanz/currency
Currency Up to implementer Most providers use "USD"

13. Common Questions

"Do you need access to our admin console?"

No. We only need programmatic API access. All operations go through your public API—see Section 6 for credential details.

"What images/OS should our VMs run?"

Requirement Details
OS Ubuntu 22.04 (preferred) or 24.04

Custom images work if they meet these requirements. The SDK validates image compatibility via ValidateInstanceImage().

"What if we don't have public IPs?"

Public IP with SSH access is required for standard integration. Bastion/jump host routing is supported (see InternalPortMappings in the Instance struct). Other alternatives (VPN, Cloudflare tunnels) require custom integration work.

"How do you track GPU interconnect (NVLink, SXM, PCIe)?"

We track interconnect type via the GPU.NetworkDetails field. Your implementation should populate this with values like "PCIe", "SXM", "SXM4", or "SXM5". If you have multiple variants (e.g., PCIe vs SXM versions of the same GPU), surface them as separate instance types.

"What SLA/uptime do you expect from our API?"

Requirement Target
Availability 99%+ uptime
Response time < 5 seconds typical
Idempotency Supported where possible

The Instance Syncer is resilient to brief outages—it retries and recovers automatically.

"What does Brev do on the VMs after launch?"

After CreateInstance returns successfully:

  1. SSH connection: Brev waits for SSH to become available (up to 10 minutes via ValidateInstanceSSHAccessible)
  2. Key bootstrapping: Brev adds admin keys to authorized_keys via SSH
  3. Agent setup: Brev installs a lightweight agent for tunnel management and environment configuration

You don't need to do anything special—just ensure the SSH public key from CreateInstanceAttrs.PublicKey is installed before the instance becomes accessible.


Next Steps

To begin integration:

  1. Follow the Integration Guide and copy the template — Start with the Integration Guide, which walks through the v1 interfaces, directory layout, and a copy/paste scaffold. Use the Lambda Labs provider as your canonical reference.
  2. Implement your Cloud provider — Build out instance lifecycle, instance types, capabilities, and security conformance under internal/{provider}/v1/. Embed NotImplCloudClient for any unsupported operations.
  3. Run the local Validation Tests — Wire up validation_test.go using real credentials and run make test-validation locally. This exercises instance create/get/list/terminate, instance types, and capability checks against your live API.
  4. Provide Brev with a test account — Give Brev access to run validation independently. This typically means a console account or provided API credentials, but exact requirements vary by provider.
  5. Brev validates end-to-end flow — We run the full validation suite plus our internal end-to-end tests against your provider, confirm catalog readiness, and enable it in Brev.

See the Integration Guide for detailed implementation instructions, and reach out to the Brev team with any questions.


Document version: 2.0 For Brev integration partners