Brev Cloud Provider Integration Guide

For Cloud Infrastructure Providers Integrating with Brev

Integration Overview
How Brev Discovers Your Inventory
Instance Types: Your Compute Catalog
Location Model
GPU Normalization
Credential and Authentication Model
Instance Lifecycle Operations
SSH Connectivity
Firewall and Security Groups
Instance Metadata and Tags
Error Handling and Status Reporting
Billing and Pricing
Common Questions

1. Integration Overview

What Does Integration Mean?

When you integrate with Brev, you're allowing Brev's control plane to:

Sync your available GPU instance types into Brev's catalog
Provision instances on your infrastructure via API calls
Manage instance lifecycle (start, stop, terminate) through your API
Connect to running instances via SSH to configure them

What Brev Needs From You (Cloud Provider)

Requirement	Purpose
Instance Type Listing API	Discover your available instance types
Instance Lifecycle APIs	Create, get, start, stop, terminate
API Credentials for Brev	Authenticate Brev's calls to your API
SSH Key Injection	Accept SSH public key at VM creation
SSH Access	Control plane communication to VMs

Integration Architecture

System Architecture Diagram

┌────────────────────────────────────────────────────────────────────────────────────┐
│                              Brev Control Plane (dev-plane)                        │
│                                                                                    │
│  ┌──────────────────────────────────┐    ┌──────────────────────────────────────┐  │
│  │         Syncer Layer             │    │     Instance Service Layer           │  │
│  │    (Continuous Reconciliation)   │    │       (User-Triggered Actions)       │  │
│  │                                  │    │                                      │  │
│  │  ┌────────────────────────────┐  │    │  ┌────────────────────────────────┐  │  │
│  │  │  InstanceTypeSyncer        │  │    │  │  Instance Lifecycle            │  │  │
│  │  │  ─────────────────────     │  │    │  │  ─────────────────────         │  │  │
│  │  │  Calls:                    │  │    │  │  Calls:                        │  │  │
│  │  │  • GetInstanceTypes()      │  │    │  │  • CreateInstance()            │  │  │
│  │  │  • GetLocations()          │  │    │  │  • TerminateInstance()         │  │  │
│  │  │  • GetInstanceTypePollTime │  │    │  │  • StopInstance()              │  │  │
│  │  │                            │  │    │  │  • StartInstance()             │  │  │
│  │  │  Interval: 1-5 min         │  │    │  │                                │  │  │
│  │  └────────────┬───────────────┘  │    │  └──────────────┬─────────────────┘  │  │
│  │               │                  │    │                 │                    │  │
│  │  ┌────────────┴───────────────┐  │    │  ┌──────────────┴─────────────────┐  │  │
│  │  │  InstanceSyncer            │  │    │  │  Instance State & Queries      │  │  │
│  │  │  ─────────────────────     │  │    │  │  ─────────────────────         │  │  │
│  │  │  Calls:                    │  │    │  │  Calls:                        │  │  │
│  │  │  • ListInstances()         │  │    │  │  • GetInstance()               │  │  │
│  │  │                            │  │    │  │  • ListInstances()             │  │  │
│  │  │  Interval: 5 sec           │  │    │  │  • AddFirewallRulesToInstance  │  │  │
│  │  └────────────┬───────────────┘  │    │  │  • ResizeInstanceVolume()      │  │  │
│  │               │                  │    │  │  • UpdateInstanceTags()        │  │  │
│  └───────────────┼──────────────────┘    │  └──────────────┬─────────────────┘  │  │
│                  │                       └─────────────────┼────────────────────┘  │
│                  │                                         │                       │
└──────────────────┼─────────────────────────────────────────┼───────────────────────┘
                   │                                         │
                   │       ┌─────────────────────────────────┘
                   │       │
                   ▼       ▼
┌────────────────────────────────────────────────────────────────────────────────────┐
│                           CLOUD SDK (v1) - This Repo                               │
│                                                                                    │
│  ┌──────────────────────────────────────────────────────────────────────────────┐  │
│  │                         CloudClient Interface                                │  │
│  │  (Composed of: CloudCredential, CloudBase, CloudQuota, CloudStopStart,       │  │
│  │   CloudReboot, CloudResizeVolume, CloudModifyFirewall, CloudInstanceTags...) │  │
│  └──────────────────────────────────────────────────────────────────────────────┘  │
│                                                                                    │
│  ┌──────────────────────────────────────────────────────────────────────────────┐  │
│  │                        Provider Implementations                              │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐            │  │
│  │  │ Lambda   │ │ Fluidstk │ │ Shadefrm │ │  Nebius  │ │  Your    │            │  │
│  │  │ Labs     │ │          │ │          │ │          │ │ Provider │   • • •    │  │
│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘            │  │
│  └───────┼────────────┼────────────┼────────────┼────────────┼──────────────────┘  │
│          │            │            │            │            │                     │
└──────────┼────────────┼────────────┼────────────┼────────────┼─────────────────────┘
           │            │            │            │            │
           ▼            ▼            ▼            ▼            ▼
┌────────────────────────────────────────────────────────────────────────────────────┐
│                            CLOUD PROVIDER APIs                                     │
│                                                                                    │
│  Each provider's native REST/gRPC API for instance management                      │
└────────────────────────────────────────────────────────────────────────────────────┘

2. How Brev Discovers Your Inventory

The Instance Type Syncer

Brev runs a continuous synchronization process that periodically queries your API to understand what compute is available.

Sync Behavior:

Polls your instance type listing API at a configurable interval you define via GetInstanceTypePollTime() (default: 1 minute; existing implementations use 1-5 minutes depending on provider needs)
Compares current catalog to previous state
Updates availability, pricing, and specs as they change
Marks types as unavailable when removed from your API
Adds new types when they appear

What We Query

We need an API endpoint that returns your available instance types. For each type, we map your data to the v1.InstanceType struct (defined in cloud/v1/instancetype.go):

Core Instance Type Fields:

Struct Field	Type	Description	Example
`Type`	`string`	Your internal type name	`"gpu_1x_a100_80gb_sxm4"`
`Location`	`string`	Region identifier	`"us-west-1"`
`VCPU`	`int32`	vCPU count	`128`
`MemoryBytes`	`Bytes`	RAM (use `v1.NewBytes()`)	`v1.NewBytes(1024, v1.Gibibyte)`
`BasePrice`	`*currency.Amount`	Hourly price in USD	`currency.NewAmountFromInt64(3200, "USD")` (= $32.00/hr)
`IsAvailable`	`bool`	Currently launchable	`true`

GPU Details (SupportedGPUs []GPU):

Struct Field	Type	Description	Example
`Count`	`int32`	Number of GPUs	`8`
`Name`	`string`	GPU model name	`"A100"`
`MemoryBytes`	`Bytes`	VRAM per GPU	`v1.NewBytes(80, v1.Gibibyte)`
`NetworkDetails`	`string`	Interconnect type	`"SXM4"`, `"PCIe"`
`Manufacturer`	`Manufacturer`	GPU vendor	`v1.ManufacturerNVIDIA`

Storage Details (SupportedStorage []Storage):

Struct Field	Type	Description	Example
`SizeBytes`	`Bytes`	Disk size	`v1.NewBytes(2000, v1.Gibibyte)`
`Type`	`string`	Storage type	`"ssd"`, `"nvme"`
`PricePerGBHr`	`*currency.Amount`	Additional storage cost	`nil` (if included in base price)

Example: Converting Provider Data to v1.InstanceType

From Lambda Labs implementation (cloud/v1/providers/lambdalabs/instancetype.go):

it := v1.InstanceType{
    Location:      location,
    Type:          instType.Name,                                           // "gpu_1x_a100_80gb_sxm4"
    SupportedGPUs: []v1.GPU{{
        Count:       8,
        Name:        "A100",
        MemoryBytes: v1.NewBytes(80, v1.Gibibyte),
        NetworkDetails: "SXM4",
        Manufacturer: v1.ManufacturerNVIDIA,
    }},
    SupportedStorage: []v1.Storage{{
        Type:      "ssd",
        SizeBytes: v1.NewBytes(instType.Specs.StorageGib, v1.Gibibyte),
    }},
    VCPU:        instType.Specs.Vcpus,
    MemoryBytes: v1.NewBytes(instType.Specs.MemoryGib, v1.Gibibyte),
    BasePrice:   &amount,
    IsAvailable: isAvailable,
    Provider:    CloudProviderID,
    Cloud:       CloudProviderID,
}
it.ID = v1.MakeGenericInstanceTypeID(it)  // Generate ID using helper (or set your own)

API Type Declaration

When implementing the Cloud SDK, you declare how Brev's control plane should query your integration via GetAPIType():

API Type	Meaning	Control Plane Behavior
`APITypeGlobal`	Your `GetInstanceTypes()` returns all regions in one call	Brev calls once with `locations = ["all"]`
`APITypeLocational`	Your `GetInstanceTypes()` is region-scoped	Brev iterates over `GetLocations()` results

You handle the mapping internally. The SDK doesn't call your API directly—your implementation does. Whether your cloud's native API is regional, global, or something else entirely, you write the conversion logic in GetInstanceTypes().

Example: Global API (Lambda Labs) Lambda Labs' API returns all instance types with regional availability embedded. The SDK implementation fetches once and expands to per-region v1.InstanceType entries:

// Simplified from cloud/v1/providers/lambdalabs/instancetype.go
func (c *LambdaLabsClient) GetInstanceTypes(ctx context.Context, args v1.GetInstanceTypeArgs) ([]v1.InstanceType, error) {
    resp, _ := c.client.InstanceTypes(ctx)  // Single API call returns all types
    
    // Expand each type to all its available regions
    for _, instType := range resp.Data {
        for _, region := range locations {
            isAvailable := slices.Contains(instType.RegionsWithCapacityAvailable, region.Name)
            instanceTypes = append(instanceTypes, convertToV1(region.Name, instType, isAvailable))
        }
    }
    return instanceTypes, nil
}

Example: Locational API (Nebius) Nebius requires per-region quota checks. The SDK implementation iterates regions internally:

// Simplified from cloud/v1/providers/nebius/instancetype.go
func (c *NebiusClient) GetInstanceTypes(ctx context.Context, args v1.GetInstanceTypeArgs) ([]v1.InstanceType, error) {
    platforms, _ := c.sdk.Compute().Platform().List(ctx, c.projectID)
    
    for _, location := range locations {
        // Check quota per-region
        isAvailable := c.checkQuotaAvailability(platform, location.Name, quotaMap)
        instanceTypes = append(instanceTypes, convertToV1(location.Name, platform, isAvailable))
    }
    return instanceTypes, nil
}

Key point: You decide how to call your cloud's API. Brev only cares that GetInstanceTypes() returns properly formatted v1.InstanceType entries with accurate Location and IsAvailable fields.

3. Instance Types: Your Compute Catalog

What Is an Instance Type to Brev?

Brev treats compute as inventory. Each instance type represents a distinct compute configuration in your catalog. Users browse your instance types filtered by GPU, region, price, and availability.

The Canonical Instance Type Model

When we ingest your instance types, we normalize them to the v1.InstanceType struct. Here are the key fields (see cloud/v1/instancetype.go for the complete definition):

Field	Type	Description
`ID`	`InstanceTypeID`	Stable, unique identifier (you define the format—see below)
`Cloud`	`string`	Your cloud identifier (e.g., `"lambdalabs"`, `"crusoe"`)
`Provider`	`string`	Provider identifier (often same as `Cloud`)
`Type`	`string`	Your native type name
`Location`	`string`	Primary region identifier
`SubLocation`	`string`	Availability zone (optional; helper uses `"noSub"` if empty)
`AvailableAzs`	`[]string`	All zones where this type is available
`SupportedGPUs`	`[]GPU`	GPU details (see `GPU` struct below)
`VCPU`	`int32`	vCPU count
`MemoryBytes`	`Bytes`	RAM (use `v1.NewBytes()` helper)
`SupportedStorage`	`[]Storage`	Storage options (see `Storage` struct)
`BasePrice`	`*currency.Amount`	Hourly price in USD
`IsAvailable`	`bool`	Currently launchable
`Stoppable`	`bool`	Can instances be stopped/resumed
`Rebootable`	`bool`	Can instances be rebooted

The GPU struct (cloud/v1/instancetype.go):

Field	Type	Description
`Count`	`int32`	Number of GPUs
`Name`	`string`	GPU model name (e.g., `"A100"`, `"H100"`)
`Type`	`string`	Full GPU type (e.g., `"A100.SXM4"`)
`MemoryBytes`	`Bytes`	VRAM per GPU
`MemoryDetails`	`string`	Memory type: `"HBM"`, `"GDDR"`, etc.
`NetworkDetails`	`string`	Interconnect: `"PCIe"`, `"SXM4"`, `"SXM5"`
`Manufacturer`	`Manufacturer`	`ManufacturerNVIDIA`, `ManufacturerIntel`, etc.

The Storage struct (cloud/v1/storage.go):

Field	Type	Description
`Count`	`int32`	Number of disks
`SizeBytes`	`Bytes`	Disk size
`Type`	`string`	Storage type (e.g., `"ssd"`, `"nvme"`)
`PricePerGBHr`	`*currency.Amount`	Additional storage cost (if applicable)
`IsEphemeral`	`bool`	Lost on stop/terminate

Instance Type ID

The ID field must be a stable, unique identifier for each instance type across all regions. You control the format.

Requirements:

Stable: The same instance type must return the same ID on every sync
Unique: No two instance types can share an ID
Deterministic: IDs must not change between API calls

Option 1: Use the Helper Function

The SDK provides MakeGenericInstanceTypeID() which generates IDs using this pattern:

{location}-{subLocation}-{type}

If your instance type has no sublocation, the helper uses "noSub" as a placeholder.

// Set all fields first, then call the helper at the END
it := v1.InstanceType{
    Location: "us-west-1",
    Type:     "gpu_1x_a100",
    // ... other fields
}
it.ID = v1.MakeGenericInstanceTypeID(it)  // Result: "us-west-1-noSub-gpu_1x_a100"

Option 2: Define Your Own Format

If you prefer a different ID format, set ID directly:

// Shadeform uses: {cloud}_{instanceType}_{region}
it := v1.InstanceType{
    ID:       v1.InstanceTypeID("massedcompute_L40_desmoines-usa-1"),
    Location: "desmoines-usa-1",
    Type:     "massedcompute_L40",
    // ... other fields
}

Why Stability Matters:

Brev uses this ID to track inventory and match provisioning requests. If your IDs change between syncs, Brev loses the ability to correlate instance types correctly.

CRITICAL: ID Consistency Between InstanceType and Instance

Warning: This is the most common cause of integration failures. Instance types may sync successfully but instances fail to provision or appear "orphaned."

When Brev provisions an instance, it looks up the corresponding instance type using the instance's InstanceTypeID. These IDs must match exactly.

A Common Problem:

The SDK has two helper functions that generate IDs differently:

Function	Used For	SubLocation Source
`MakeGenericInstanceTypeID()`	InstanceType structs	`AvailableAzs[0]` (first AZ)
`MakeGenericInstanceTypeIDFromInstance()`	Instance structs	`SubLocation` field

If AvailableAzs[0] and SubLocation don't match, the IDs diverge and lookup fails.

The Mistakes:

// WRONG - Manually setting InstanceTypeID
inst := &v1.Instance{
    InstanceType:   "gpu-h100-8x",
    InstanceTypeID: v1.InstanceTypeID("gpu-h100-8x"),  // BUG: Missing location!
}

// WRONG - Inconsistent SubLocation vs AvailableAzs
instanceType := v1.InstanceType{
    Location:     "us-east-1",
    SubLocation:  "us-east-1a",      // Set to "us-east-1a"
    AvailableAzs: []string{"us-east-1b"},  // But AZs has "us-east-1b"!
}

The Fix:

For InstanceType: Set all fields first, then call MakeGenericInstanceTypeID() at the END
For Instance: Set all fields first, then call MakeGenericInstanceTypeIDFromInstance() at the END
Ensure consistency: If you set both SubLocation and AvailableAzs, make sure SubLocation == AvailableAzs[0]

// CORRECT - InstanceType
it := v1.InstanceType{
    Location:     "us-east-1",
    AvailableAzs: []string{"us-east-1a"},
    Type:         "gpu-h100-8x",
    // ... other fields
}
it.ID = v1.MakeGenericInstanceTypeID(it)  // LAST

// CORRECT - Instance
inst := &v1.Instance{
    Location:     "us-east-1",
    SubLocation:  "us-east-1a",  // Matches the AZ
    InstanceType: "gpu-h100-8x",
    // ... other fields
}
inst.InstanceTypeID = v1.MakeGenericInstanceTypeIDFromInstance(*inst)  // LAST

Symptoms of ID Mismatch:

Instance types sync successfully but don't appear in the Brev catalog
CreateInstance succeeds but subsequent operations fail
"instance type not found" errors during provisioning
Instances appear "orphaned" (no associated instance type)

Validating Your Instance Type IDs

The SDK provides validation functions to catch ID generation issues early. Run these in your test suite:

1. ValidateStableInstanceTypeIDs - Ensures your instance type IDs are stable and unique:

// In your validation tests
err := v1.ValidateStableInstanceTypeIDs(ctx, client, stableIDs)
require.NoError(t, err, "ValidateStableInstanceTypeIDs should pass")

This validates:

Each instance type ID is unique (no duplicates)
Your designated stable IDs exist in the current instance types
All instance types have required properties (base price, storage pricing)

2. ValidateCreateInstance - Validates that instance and instance type IDs match:

// In your validation tests
instance, err := v1.ValidateCreateInstance(ctx, client, attrs, selectedType)
require.NoError(t, err, "ValidateCreateInstance should pass")

This validates (among other things):

instance.InstanceTypeID == selectedType.ID — catches ID generation mismatches
instance.RefID matches the provided RefID
Location and instance type fields are consistent

Why this matters: If MakeGenericInstanceTypeID() and MakeGenericInstanceTypeIDFromInstance() produce different IDs for the same logical type, the control plane cannot correlate instances with their types. ValidateCreateInstance catches this.

See internal/validation/suite.go for the full validation test suite you can use as a reference.

4. Location Model

The Location Hierarchy

Brev uses a three-level location model to represent where compute resources exist:

Level	Field	Description	Example
Region	`Location`	Primary geographic region	`"us-west-1"`, `"europe-west4"`
Availability Zone	`SubLocation`	Specific zone within a region	`"us-west-1a"`, `"europe-west4-b"`
Available Zones	`AvailableAzs`	All zones where this type can launch	`["us-west-1a", "us-west-1b"]`

Note: The distinction between these fields can be confusing. Location is the region, SubLocation is a specific zone (used for instances), and AvailableAzs lists all zones where an instance type is available (used for instance types).

The Location Struct

When implementing GetLocations(), you return a list of Location structs (defined in cloud/v1/location.go):

Field	Type	Description
`Name`	`string`	Region identifier (acts as the ID)
`Description`	`string`	Human-readable name
`Available`	`bool`	Whether the region is currently operational
`Endpoint`	`string`	API endpoint for this region (if applicable)
`Priority`	`int`	Preference order for region selection
`Country`	`string`	ISO 3166-1 alpha-3 country code

Availability on Instance Types

Availability is tracked per instance type using two fields on the InstanceType struct:

Field	Type	Meaning
`IsAvailable`	`bool`	Whether this type can currently be launched
`AvailableAzs`	`[]string`	Which availability zones have capacity

Interpreting Availability:

IsAvailable: true + AvailableAzs: ["us-west-1a", "us-west-1b"] = Can launch in either AZ
IsAvailable: false = Type exists but is currently out of stock or disabled
Empty AvailableAzs with IsAvailable: true = Region-level availability only (no AZ granularity)

5. GPU Normalization

The GPU Struct

The Cloud SDK represents GPUs with these fields:

type GPU struct {
    Name           string           // Base model: "H100", "A100", "L40S"
    Count          int32            // Number of GPUs
    Memory         units.Base2Bytes // VRAM per GPU (deprecated, use MemoryBytes)
    MemoryBytes    Bytes            // VRAM per GPU in structured format
    MemoryDetails  string           // Memory type: "HBM2", "HBM3", "HBM2e", "GDDR"
    NetworkDetails string           // Form factor: "PCIe", "SXM", "SXM4", "SXM5"
    Manufacturer   Manufacturer     // "NVIDIA", "AMD", "Intel"
    Type           string           // Optional: original type identifier
}

Implementer Responsibility

You are responsible for normalizing GPU data. Brev does not automatically parse GPU descriptions. Your GetInstanceTypes must populate the GPU struct.

Field	Example	Notes
`Name`	`"H100"`, `"A100"`	Base model, uppercase
`Count`	`8`	GPUs per instance
`MemoryBytes`	`v1.NewBytes(80, v1.Gibibyte)`	VRAM per GPU
`NetworkDetails`	`"SXM4"`, `"PCIe"`	Form factor
`Manufacturer`	`"NVIDIA"`

Provider Examples

Lambda Labs (cloud/v1/providers/lambdalabs/instancetype.go:parseGPUFromDescription)

Parses "8x A100 (40 GB SXM4)" using regex:

gpu.Count = int32(count)           // from (\d+)x
gpu.Name = nameStr                 // from x (.*?) \(
gpu.MemoryBytes = v1.NewBytes(v1.BytesValue(memoryGiB), v1.Gibibyte)
gpu.NetworkDetails = networkDetails // remainder after "GB"
gpu.Manufacturer = "NVIDIA"

Launchpad (cloud/v1/providers/launchpad/instancetype.go:launchpadGpusToGpus)

Maps structured API fields:

gpus[i] = v1.GPU{
    Name:           strings.ToUpper(gp.Family),
    Count:          gp.Count,
    MemoryBytes:    v1.NewBytes(v1.BytesValue(gp.MemoryGb), v1.Gigabyte),
    NetworkDetails: string(gp.InterconnectionType),
    Manufacturer:   v1.GetManufacturer(gp.Manufacturer),
}

Key Points

Name: base model only ("H100" not "NVIDIA H100 80GB")
NetworkDetails: "SXM", "SXM4", "SXM5", or "PCIe"
Manufacturer: always set to "NVIDIA"

6. Credential and Authentication Model

How Brev Authenticates to Your API

Brev stores credentials for your cloud provider and uses them to make API calls. This is a direct relationship between Brev's control plane and your cloud API.

What You Need to Provide

Requirement	Details
API Credentials	A JSON-serializable Go struct containing your authentication fields (API key, token, service account, etc.)
Authentication Endpoint	How Brev authenticates (API key header, OAuth, etc.)

Credential Storage Model

Credentials are stored in Brev's control plane database as raw JSON (json.RawMessage). This means your credential struct must be JSON-serializable with proper struct tags.

How it works:

You define a credential struct with JSON tags for each field
Brev stores the struct as raw JSON bytes in the database (encrypted at rest)
Brev deserializes the JSON back into your struct type when making API calls

Example credential struct:

type MyProviderCredential struct {
    RefID  string            // Set by Brev (the cloud_cred ID)
    APIKey string `json:"api_key"`
    Region string `json:"region,omitempty"`  // Optional fields use omitempty
}

Key requirements:

All fields you need serialized must have json:"field_name" tags
The RefID field is set by Brev after storage (it's the database record ID)
Use json:"...,omitempty" for optional fields
The struct must implement the CloudCredential interface

Credential Exchange Process

You provide API credentials to Brev during integration setup
Brev stores credentials securely (encrypted at rest)
Brev uses credentials to call your API for sync and provisioning

Credential Types

Providers define their own credential struct with whatever fields they need. The struct fields use JSON tags that determine the field names in the stored JSON.

Provider	Struct Fields	JSON Fields
Lambda Labs	`APIKey string`	`api_key`
Shadeform	`APIKey string`	`api_key`
FluidStack	`APIKey string`	`api_key`
AWS	`AccessKeyID`, `SecretAccessKey`	`access_key_id`, `secret_access_key`
Nebius	`ServiceAccountKey`, `TenantID`	`service_account_key`, `tenant_id`
Launchpad	`APIToken`, `APIURL`	`api_token`, `api_url`

Complete credential struct example (from Launchpad):

type LaunchpadCredential struct {
    RefID    string            // Not serialized - set by Brev after storage
    APIToken string `json:"api_token"`
    APIURL   string `json:"api_url"`
}

var _ v1.CloudCredential = &LaunchpadCredential{}  // Compile-time interface check

func (c *LaunchpadCredential) Validate() error {
    return validation.ValidateStruct(c,
        validation.Field(&c.APIToken, validation.Required),
        validation.Field(&c.APIURL, validation.Required),
    )
}

Your credential struct must implement the CloudCredential interface, which requires these methods:

type CloudCredential interface {
    MakeClient(ctx context.Context, location string) (CloudClient, error)
    GetTenantID() (string, error)
    GetReferenceID() string
    GetAPIType() APIType
    GetCapabilities(ctx context.Context) (Capabilities, error)
    GetCloudProviderID() CloudProviderID
}

SSH Keys (Separate from API Credentials)

SSH keys are passed at instance creation time via the PublicKey field in CreateInstanceAttrs.

Your implementation must:

Accept this public key in your create instance API
Install it in the VM's default user ~/.ssh/authorized_keys before the instance becomes accessible

Brev manages SSH keys per user. The public key provided in CreateInstanceAttrs.PublicKey belongs to the user, and the control plane retains the corresponding private key to connect after creation.

7. Instance Lifecycle Operations

This section describes each lifecycle operation, its requirements, and expected behavior. Not all operations are required—providers declare their capabilities via GetCapabilities().

Lifecycle States

The SDK defines these states in LifecycleStatus (from cloud/v1/instance.go):

State	Meaning
`pending`	Create initiated, VM provisioning
`running`	Instance is up with a public IP
`stopping`	Stop requested, shutting down
`stopped`	Powered off, storage preserved
`suspending`	Suspend requested
`suspended`	Hibernated state
`terminating`	Terminate requested
`terminated`	Instance destroyed
`failed`	Provisioning or operation failed

Create Instance (Required)

Interface: CloudCreateTerminateInstance.CreateInstance(ctx, CreateInstanceAttrs) (*Instance, error)

Contract:

On success: Return an *Instance with a valid CloudID. The instance must exist in your system.
On error: Return an error and ensure no instance was created. Brev will not attempt cleanup on errors.

Key input fields from CreateInstanceAttrs:

Field	Type	Required	Description
`RefID`	`string`	Yes	Brev's reference ID; use for idempotency
`InstanceType`	`string`	Yes	Your instance type name
`Location`	`string`	Yes	Region to launch in
`SubLocation`	`string`	No	Specific availability zone
`PublicKey`	`string`	Yes	SSH public key (OpenSSH format)
`Name`	`string`	No	Display name for the instance
`ImageID`	`string`	No	OS image; use your default if empty
`DiskSize`	`units.Base2Bytes`	No	Boot disk size
`FirewallRules`	`FirewallRules`	No	Ports to open (SSH port is always required)
`Tags`	`Tags`	No	Key-value metadata
`UserDataBase64`	`string`	No	Cloud-init or startup script

Key output fields on Instance:

Field	When Required	Description
`CloudID`	Always	Your unique instance identifier
`Status.LifecycleStatus`	Always	Current state (`pending` or `running`)
`Location`	Always	Region where launched
`InstanceType`	Always	Instance type that was provisioned
`PublicIP`	When running	Public IPv4 for SSH access
`SSHUser`	Always	Username for SSH (e.g., `ubuntu`, `root`)
`SSHPort`	Always	SSH port (typically `22`)
`RefID`	Always	Echo back the input `RefID`

Example flow (from Lambda Labs implementation):

// 1. Register the SSH key with your API
keyPairResp, err := c.addSSHKey(ctx, openapi.AddSSHKeyRequest{
    Name:      attrs.RefID,
    PublicKey: &attrs.PublicKey,
})

// 2. Launch the instance with the key
resp, err := c.launchInstance(ctx, openapi.LaunchInstanceRequest{
    RegionName:       attrs.Location,
    InstanceTypeName: attrs.InstanceType,
    SshKeyNames:      []string{keyPairName},
})

// 3. Return instance details
return c.GetInstance(ctx, v1.CloudProviderInstanceID(resp.Data.InstanceIds[0]))

Terminate Instance (Required)

Interface: CloudCreateTerminateInstance.TerminateInstance(ctx, instanceID) error

Contract:

Initiate instance termination. Storage may or may not be preserved (provider-dependent).
Return nil on success, even if the instance is already terminated.
The instance should eventually reach terminated state.

Idempotency: Should succeed if called multiple times on the same instance.

Stop Instance (Optional)

Capability: CapabilityStopStartInstance

Interface: CloudStopStartInstance.StopInstance(ctx, instanceID) error

Contract:

Power off the instance while preserving storage.
Return nil once the stop operation is initiated.
Instance should transition: running → stopping → stopped

When to implement: Only if your platform supports instances that can stop and preserve storage. Lambda Labs does not support this, but Nebius does.

Start Instance (Optional)

Capability: CapabilityStopStartInstance

Interface: CloudStopStartInstance.StartInstance(ctx, instanceID) error

Contract:

Power on a previously stopped instance.
Return nil once the start operation is initiated.
Instance should transition: stopped → pending → running

Note: If you implement StopInstance, you must also implement StartInstance.

Stop/Start: Three Levels of Control

Stop/start support is controlled at three levels:

Level	What to Set	Purpose
Provider Capability	`CapabilityStopStartInstance` in `GetCapabilities()`	Indicates your API supports stop/start operations
Instance Type	`InstanceType.Stoppable = true/false`	Indicates whether this instance type can be stopped (e.g., spot instances typically cannot)
Instance	`Instance.Stoppable = true/false`	Indicates whether this specific instance can be stopped

Example - Nebius (supports stop/start):

// In GetCapabilities()
v1.CapabilityStopStartInstance,  // API supports it

// In GetInstanceTypes() - instance type level
instanceType := v1.InstanceType{
    Stoppable: true,  // This type supports stop/start
    // ...
}

// In GetInstance()/CreateInstance() - instance level
instance := v1.Instance{
    Stoppable: true,  // This instance can be stopped
    // ...
}

Example - Lambda Labs (no stop/start support):

// In GetCapabilities()
// CapabilityStopStartInstance NOT included

// In GetInstanceTypes()
instanceType := v1.InstanceType{
    Stoppable: false,  // Cannot be stopped
    // ...
}

// In GetInstance()/CreateInstance()
instance := v1.Instance{
    Stoppable: false,  // Cannot be stopped
    // ...
}

The control plane checks all three levels before allowing a stop/start operation. If any level indicates false, the operation won't be attempted.

Get Instance (Required)

Interface: CloudInstanceReader.GetInstance(ctx, instanceID) (*Instance, error)

Contract:

Return current state of the instance.
Return ErrResourceNotFound if the instance doesn't exist.

List Instances (Required)

Interface: CloudInstanceReader.ListInstances(ctx, ListInstancesArgs) ([]Instance, error)

Contract:

Return all instances matching the filter criteria.
Used by the Instance Syncer to reconcile state (called every ~5 seconds).

Capability Declaration

Your credential's GetCapabilities() must return the capabilities you support:

func (c *MyCredential) GetCapabilities(ctx context.Context) (v1.Capabilities, error) {
    return v1.Capabilities{
        v1.CapabilityCreateInstance,           // Required
        v1.CapabilityTerminateInstance,        // Required
        v1.CapabilityCreateTerminateInstance,  // Required (composite)
        // Optional:
        v1.CapabilityStopStartInstance,        // If you support stop/start
        v1.CapabilityRebootInstance,           // If you support reboot
        v1.CapabilityTags,                     // If your API supports instance tags/labels (see Section 10)
        v1.CapabilityModifyFirewall,           // If you support dynamic firewall rules
        v1.CapabilityResizeInstanceVolume,     // If you support volume resizing
    }, nil
}

Brev checks capabilities before calling optional methods. If you don't declare a capability, Brev won't attempt that operation.

Note on CapabilityTags: This capability is optional, but RefID and CloudCredRefID data is required regardless. If your API doesn't support tags, you must use an alternative mechanism to store and retrieve this data. See Section 10: Instance Metadata and Tags for details and examples.

8. SSH Connectivity

Core Requirement

Brev's control plane must be able to connect to your instances via SSH using the provided keys. This is the only hard requirement for network connectivity.

After your VM is running, Brev connects via SSH to:

Configure the environment: Install Brev agent, set up development tools
Enable connections: Set up tunnels and connection paths for users
Manage instance: Execute commands, transfer files, health checks

What You Provide at Launch

When provisioning, we pass:

SSH public key: Key to install in authorized_keys (via CreateInstanceAttrs.PublicKey)
Firewall rules: Ports to open (see Section 9)

Instance Requirements

Your instances must return these fields so Brev can connect:

Field	Required	Description
`SSHUser`	Yes	Username for SSH (e.g., `ubuntu`, `root`, `ec2-user`)
`SSHPort`	Yes	SSH port (commonly `22`, but can be any port)
`PublicIP`	Yes	Publicly routable address for SSH connection

Note: While PublicIP is the required field, public routing via DNS also works in practice. The key requirement is that Brev can reach your instance over SSH.

SSH User

Brev connects as the default user your image provides:

Image	Default User
Ubuntu	`ubuntu`
Debian	`admin` or `debian`
Amazon Linux	`ec2-user`
Custom	Whatever you configure

Runtime Requirements

Requirement	Details
SSHD running	On the port specified by `Instance.SSHPort`
Port publicly reachable	No NAT or firewall blocking inbound SSH
Key installed	The public key from `CreateInstanceAttrs.PublicKey` in `authorized_keys`

9. Firewall and Security Groups

Can you dynamically expose ports at instance creation? Yes, if you support user-data or have a native firewall API.

Can you modify firewall rules after creation without SSH/reboot? Only if you have a native API. Most GPU clouds don't.

SDK Structures

type FirewallRules struct {
    IngressRules []FirewallRule
    EgressRules  []FirewallRule
}

type FirewallRule struct {
    FromPort int32
    ToPort   int32
    IPRanges []string // CIDR notation
}

Passed via CreateInstanceAttrs.FirewallRules.

If You Have a Native API

Use it. Implement CloudModifyFirewall for post-creation changes:

type CloudModifyFirewall interface {
    AddFirewallRulesToInstance(ctx context.Context, args AddFirewallRulesToInstanceArgs) error
    RevokeSecurityGroupRules(ctx context.Context, args RevokeSecurityGroupRuleArgs) error
}

Add CapabilityModifyFirewall to your capabilities.

If You Only Have User-Data

Inject UFW + iptables commands at boot. Reference implementation: cloud/v1/providers/shadeform/firewall.go.

UFW Commands - Host-level firewall:

// Core UFW pattern
commands := []string{
    "ufw --force reset",          // Reset to clean state
    "ufw default deny incoming",  // Default deny incoming
    "ufw default allow outgoing", // Default allow outgoing
    "ufw allow 22/tcp",           // Always allow SSH
    "ufw allow 2222/tcp",         // Allow alternate SSH port
}

// Add ingress rules
for _, rule := range firewallRules.IngressRules {
    if rule.FromPort == rule.ToPort {
        // Single port
        if len(rule.IPRanges) == 0 {
            commands = append(commands, fmt.Sprintf("ufw allow in from any to any port %d", rule.FromPort))
        } else {
            for _, cidr := range rule.IPRanges {
                commands = append(commands, fmt.Sprintf("ufw allow in from %s to any port %d", cidr, rule.FromPort))
            }
        }
    } else {
        // Port ranges require separate tcp/udp rules
        for _, proto := range []string{"tcp", "udp"} {
            portSpec := fmt.Sprintf("port %d:%d proto %s", rule.FromPort, rule.ToPort, proto)
            if len(rule.IPRanges) == 0 {
                commands = append(commands, fmt.Sprintf("ufw allow in from any to any %s", portSpec))
            } else {
                for _, cidr := range rule.IPRanges {
                    commands = append(commands, fmt.Sprintf("ufw allow in from %s to any %s", cidr, portSpec))
                }
            }
        }
    }
}

// Add egress rules (same pattern as ingress but with "ufw allow out to")
// ...

commands = append(commands, "ufw --force enable")

IPTables Commands - Block Docker from bypassing UFW:

Docker manipulates iptables directly, bypassing UFW. The DOCKER-USER chain is the official hook point for custom rules that Docker respects.

// Required iptables commands to secure Docker containers
iptablesCommands := []string{
    "iptables -F DOCKER-USER",                                              // Reset chain
    "iptables -A DOCKER-USER -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT", // Allow responses
    "iptables -A DOCKER-USER -i lo -j ACCEPT",                              // Allow loopback
    "iptables -A DOCKER-USER -j DROP",                                      // Drop all other inbound
    "iptables -A DOCKER-USER -j RETURN",                                    // Required by Docker
}

Without these iptables rules, a Docker container listening on 0.0.0.0:PORT would be accessible from the internet even if UFW blocks that port.

Full script generation:

// Combine UFW + iptables commands
allCommands := append(ufwCommands, iptablesCommands...)

script := ""
for _, cmd := range allCommands {
    script += fmt.Sprintf("%v\n", cmd)
}

encoded := base64.StdEncoding.EncodeToString([]byte(script))

Validation Tests:

The SDK validates firewall behavior with two tests in cloud/internal/validation/suite.go:

ValidateFirewallBlocksPort - Verifies UFW blocks non-allowed ports (tests port 9999 by default)
ValidateDockerFirewallBlocksPort - Verifies iptables DOCKER-USER chain blocks Docker container ports

These run as part of RunInstanceLifecycleValidation and RunFirewallValidation. See cloud/v1/networking_validation.go for the test implementations.

Do not implement CloudModifyFirewall. Return ErrNotImplemented.

If You Only Have IP Allowlists

See cloud/v1/providers/launchpad/instance_create.go. You can only restrict by source IP, not port. Extract /32s from the rules and pass to your API:

ips := []string{}
for _, rule := range firewallRules.IngressRules {
    for _, cidr := range rule.IPRanges {
        _, ipNet, _ := net.ParseCIDR(cidr)
        ones, bits := ipNet.Mask.Size()
        if ones == bits { // /32 only
            ips = append(ips, ipNet.IP.String())
        }
    }
}

10. Instance Metadata and Tags

Brev uses metadata to track and correlate instances. The control plane requires certain data to be persisted with instances and retrievable later.

Required Instance Data

These values MUST be stored with the instance and returned in GetInstance/ListInstances:

Field	Purpose
`RefID`	Instance correlation and idempotency (passed in `CreateInstanceAttrs.RefID`)
`CloudCredRefID`	Identifies which credential created the instance (from `GetReferenceID()`)

The `CapabilityTags` Capability

If your cloud provider's API supports instance tagging/labeling, declare v1.CapabilityTags in your capabilities:

func (c *MyCredential) GetCapabilities(ctx context.Context) (v1.Capabilities, error) {
    return v1.Capabilities{
        v1.CapabilityCreateInstance,
        v1.CapabilityTerminateInstance,
        v1.CapabilityTags,  // Declare this if your API supports tags/labels
    }, nil
}

When CapabilityTags is declared:

Store RefID, CloudCredRefID, and any additional tags via CreateInstanceAttrs.Tags
The control plane will call UpdateInstanceTags() to add metadata after creation
ListInstances() should support filtering via TagFilters for efficient queries

Example (Shadeform with tags):

// At creation - store RefID and CloudCredRefID as tags
refIDTag := fmt.Sprintf("refID=%s", attrs.RefID)
cloudCredRefIDTag := fmt.Sprintf("cloudCredRefID=%s", c.GetReferenceID())
tags := []string{refIDTag, cloudCredRefIDTag}

// When reading back - extract from tags
refID := tags["refID"]
cloudCredRefID := tags["cloudCredRefID"]

Alternative: When Tags Are NOT Supported

If your API doesn't support tags, you still must persist and return RefID and CloudCredRefID. Use creative alternatives:

Example (Lambda Labs without tags):

// At creation - encode CloudCredRefID in instance name
name := fmt.Sprintf("%s--%s", c.GetReferenceID(), time.Now().UTC().Format(timeFormat))
// Use RefID as the SSH key pair name
keyPairName := attrs.RefID

// When reading back - extract from name and SSH key
nameParts := strings.Split(instance.Name, "--")
cloudCredRefID := nameParts[0]
refID := instance.SshKeyNames[0]

Recommendation: Use Tags If Possible

Tags are the recommended and easiest integration path. They provide:

Clean separation of metadata from instance properties
Efficient server-side filtering via TagFilters
Full billing/usage tracking capabilities
Straightforward implementation

If your cloud API supports any form of instance tagging, labels, or metadata—use it.

Before implementing a custom solution, please reach out to the Brev team. We can help design an approach that works reliably with the control plane and avoid edge cases that could cause instance correlation issues.

11. Error Handling and Status Reporting

Error Categories

Your provider implementation should translate API errors into the standard error constants defined in v1/errors.go:

Category	Examples	Return This Error Constant
Out of Stock	No capacity in region	`v1.ErrInsufficientResources`
Quota Exceeded	Hit account limit	`v1.ErrOutOfQuota`
Resource Not Found	Instance/image doesn't exist	`v1.ErrResourceNotFound`, `v1.ErrInstanceNotFound`, `v1.ErrImageNotFound`
Service Unavailable	API temporarily down	`v1.ErrServiceUnavailable`
Auth Failed	Bad API key	Return HTTP 401/403 error
Internal Error	Your system issue	Return error with HTTP 500 details

Reference: See v1/errors.go for the full list of error constants:

var (
	ErrInsufficientResources = errors.New("zone has insufficient resources to fulfill the request, InsufficientCapacity")
	ErrOutOfQuota            = errors.New("out of quota in the region fulfill the request, InsufficientQuota")
	ErrImageNotFound         = errors.New("image not found")
	ErrDuplicateFirewallRule = errors.New("duplicate firewall rule")
	ErrInstanceNotFound      = errors.New("instance not found")
	ErrResourceNotFound      = errors.New("resource not found")
	ErrServiceUnavailable    = errors.New("api is temporarily unavailable")
)

Out of Stock Handling

"Out of stock" is common with GPUs. Your implementation should return v1.ErrInsufficientResources:

Your API returns your specific "no capacity" error
Your provider translates this to v1.ErrInsufficientResources
Brev marks that type as temporarily unavailable in that region
The syncer will re-check availability on the next poll

Example from Shadeform provider (v1/providers/shadeform/instance.go):

if shadeformErrorResponse.ErrorCode == outOfStockErrorCode {
    return v1.ErrInsufficientResources
}

Example from Lambda Labs provider (v1/providers/lambdalabs/errors.go):

if strings.Contains(e.Error(), "Not enough capacity") || strings.Contains(e.Error(), "insufficient-capacity") {
    return v1.ErrInsufficientResources
}

12. Billing and Pricing

Billing

Billing arrangements are handled separately during the integration partnership setup. In most cases, this simply means Brev creates an account on your cloud platform with a credit card on file. There is no special billing integration or reconcillation process required.

How Pricing Works

Brev displays your prices via InstanceType.BasePrice (see v1/instancetype.go).

Field	Type	Notes
BasePrice	`*currency.Amount`	From `github.com/bojanz/currency`
Currency	Up to implementer	Most providers use `"USD"`

13. Common Questions

"Do you need access to our admin console?"

No. We only need programmatic API access. All operations go through your public API—see Section 6 for credential details.

"What images/OS should our VMs run?"

Requirement	Details
OS	Ubuntu 22.04 (preferred) or 24.04

Custom images work if they meet these requirements. The SDK validates image compatibility via ValidateInstanceImage().

"What if we don't have public IPs?"

Public IP with SSH access is required for standard integration. Bastion/jump host routing is supported (see InternalPortMappings in the Instance struct). Other alternatives (VPN, Cloudflare tunnels) require custom integration work.

"How do you track GPU interconnect (NVLink, SXM, PCIe)?"

We track interconnect type via the GPU.NetworkDetails field. Your implementation should populate this with values like "PCIe", "SXM", "SXM4", or "SXM5". If you have multiple variants (e.g., PCIe vs SXM versions of the same GPU), surface them as separate instance types.

"What SLA/uptime do you expect from our API?"

Requirement	Target
Availability	99%+ uptime
Response time	< 5 seconds typical
Idempotency	Supported where possible

The Instance Syncer is resilient to brief outages—it retries and recovers automatically.

"What does Brev do on the VMs after launch?"

After CreateInstance returns successfully:

SSH connection: Brev waits for SSH to become available (up to 10 minutes via ValidateInstanceSSHAccessible)
Key bootstrapping: Brev adds admin keys to authorized_keys via SSH
Agent setup: Brev installs a lightweight agent for tunnel management and environment configuration

You don't need to do anything special—just ensure the SSH public key from CreateInstanceAttrs.PublicKey is installed before the instance becomes accessible.

Next Steps

To begin integration:

Follow the Integration Guide and copy the template — Start with the Integration Guide, which walks through the v1 interfaces, directory layout, and a copy/paste scaffold. Use the Lambda Labs provider as your canonical reference.
Implement your Cloud provider — Build out instance lifecycle, instance types, capabilities, and security conformance under internal/{provider}/v1/. Embed NotImplCloudClient for any unsupported operations.
Run the local Validation Tests — Wire up validation_test.go using real credentials and run make test-validation locally. This exercises instance create/get/list/terminate, instance types, and capability checks against your live API.
Provide Brev with a test account — Give Brev access to run validation independently. This typically means a console account or provided API credentials, but exact requirements vary by provider.
Brev validates end-to-end flow — We run the full validation suite plus our internal end-to-end tests against your provider, confirm catalog readiness, and enable it in Brev.

See the Integration Guide for detailed implementation instructions, and reach out to the Brev team with any questions.

Document version: 2.0 For Brev integration partners

FilesExpand file tree

CloudManual.md

Latest commit

History

CloudManual.md

File metadata and controls

Brev Cloud Provider Integration Guide

Table of Contents

1. Integration Overview

What Does Integration Mean?

What Brev Needs From You (Cloud Provider)

Integration Architecture

System Architecture Diagram

2. How Brev Discovers Your Inventory

The Instance Type Syncer

What We Query

API Type Declaration

3. Instance Types: Your Compute Catalog

What Is an Instance Type to Brev?

The Canonical Instance Type Model

Instance Type ID

CRITICAL: ID Consistency Between InstanceType and Instance

Validating Your Instance Type IDs

4. Location Model

The Location Hierarchy

The Location Struct

Availability on Instance Types

5. GPU Normalization

The GPU Struct

Implementer Responsibility

Provider Examples

Key Points

6. Credential and Authentication Model

How Brev Authenticates to Your API

What You Need to Provide

Credential Storage Model

Credential Exchange Process

Credential Types

SSH Keys (Separate from API Credentials)

7. Instance Lifecycle Operations

Lifecycle States

Create Instance (Required)

Terminate Instance (Required)

Stop Instance (Optional)

Start Instance (Optional)

Stop/Start: Three Levels of Control

Get Instance (Required)

List Instances (Required)

Capability Declaration

8. SSH Connectivity

Core Requirement

What You Provide at Launch

Instance Requirements

SSH User

Runtime Requirements

9. Firewall and Security Groups

SDK Structures

If You Have a Native API

If You Only Have User-Data

If You Only Have IP Allowlists

10. Instance Metadata and Tags

Required Instance Data

The CapabilityTags Capability

Alternative: When Tags Are NOT Supported

Recommendation: Use Tags If Possible

11. Error Handling and Status Reporting

Error Categories

Out of Stock Handling

12. Billing and Pricing

Billing

How Pricing Works

13. Common Questions

"Do you need access to our admin console?"

"What images/OS should our VMs run?"

"What if we don't have public IPs?"

"How do you track GPU interconnect (NVLink, SXM, PCIe)?"

"What SLA/uptime do you expect from our API?"

"What does Brev do on the VMs after launch?"

Next Steps

The `CapabilityTags` Capability