Updating serverless documentation as part of rewrite #58

DavidatVast · 2026-01-26T22:32:51Z

Self explanatory - rewrote / re-ordered serverless documentation

https://vastai.atlassian.net/browse/AUTO-1104

robertfernandez-vast · 2026-02-01T19:07:03Z

Developers want to use serverless. Based off of this fact, the following should be assumed true:

For a developer to use serverless, they must know how to set-up their serverless environment.
For a developer to use serverless, they must know how to interact with their serverless environment.
For a developer to use serverless, they must know how to maintain their serverless environment.

All sections should serve and be organized based off of these statements.

Users can set-up their serverless environment via:

The vast website
The vast-sdk
The vast cli

We should have sections which serve each of these form factors.

Setting up their serverless environment includes:

Creating and destroying endpoints and what that does to their data and serverless environment
Creating and destroying workergroups and what that does to their data and serverless environment
Using our predefined templates or defining custom templates
- This includes instructing developers on how to make their own pyworkers for their own use cases.
Addressing frequently occurring questions/pain points with set-up
- Why does it take so long for workers to load?
- How do I know when my serverless environment is ready for use?
- How do I know what serverless environment is best suited for my use case?
Definitions for each worker state and how to understand how to use these states to get your serverless environment setup.

Users can interact with their serverless environment via:

The vast-sdk
The vast-cli

We should organize the sections so that it is clear to developers how they interact with serverless for each form factor.

Interacting includes:

How to send requests to your endpoint group once its set-up

Users can maintain their serverless environment via:

The vast website
The vast-sdk
The vast-cli

We should organize the sections so that it is clear how they maintain their serverless environment.

Maintaining includes:

Paying for their serverless environment
Keeping their pyworkers up-to-date
Best Practices for keeping their data safe (ex: backing up data)
Describing what our metrics mean on the endpoint page, and what assumptions go into each metric. (Units, limitations, definitions)
How to understand how your serverless environment is behaving based on worker states.

documentation/serverless/architecture.mdx

LucasArmandVast · 2026-02-02T17:56:45Z

documentation/serverless/architecture.mdx

+- Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util`

-You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system.
+Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development).


Might be useful to explain here that endpoints act as "routing targets", and that all requests sent to an endpoint are load-balanced across that endpoints workers.

LucasArmandVast · 2026-02-02T17:57:47Z

documentation/serverless/architecture.mdx

+### Workergroups

-The system architecture for an application using Vast.ai Serverless includes the following components:
+A **Workergroup** defines how workers are recruited and created. Workergroups are configured with [**workergroup-level parameters**](./workergroup-parameters) and are responsible for selecting which GPU offers are eligible for worker creation.


More than just GPU selection, workergroups decide what code is actually running on the endpoint via the template. It's probably important to mention this here.

documentation/serverless/architecture.mdx

LucasArmandVast · 2026-02-02T18:10:14Z

documentation/serverless/architecture.mdx

+
+### Workers
+
+**Workers** are individual GPU instances created and managed by the Serverless engine. Each worker runs a [**PyWorker**](./overview), a Python web server that loads the machine learning model and serves requests.


PyWorker does not load the machine learning model. Both the pyworker and the "machine learning model" (maybe a better, more generic name for this?) are launched by the Docker entrypoint or on_start.sh script. The pyworker runs in parallel with the machine learning model and listens to it (via it's log output). It reports the readiness of the model to the autoscaler, and acts as a proxy for incoming requests on the worker, keeping track of requests and passing them along to the model server.

Taking Lucas' comment, a more accurate sentence might be:
"Each worker runs a PyWorker, a Python web server that runs the machine learning model's on-start script, orchestrates hardware benchmarking, and serves requests."

But the PyWorker does not run the machine learning model's on-start script. Both the PyWorker and the machine learning model are started completely separately, both from the same on-start script. Two different processes, each started from the same script. They do not invoke one another. They communicate only by logs and passing HTTP requests along.

Updated to:
Workers are individual GPU instances created and managed by the Serverless engine. Each worker runs a PyWorker, a Python web server that monitors the inference server's readiness, proxies incoming requests, and coordinates with the autoscaler.

LucasArmandVast · 2026-02-02T18:15:18Z

documentation/serverless/architecture.mdx

+
+- Receiving and processing inference requests
+- Reporting performance metrics (load, utilization, benchmark results)
+- Participating in automated scaling and routing decisions


"- Participating in automated scaling and routing decisions"
This is vague and only sort-of true. They do "participate" in the sense that they report information on their health, current load, dropped requests, and benchmark speed, but outside of just reporting that information, they don't actually take action for routing or scaling. All decision making is handled by the autoscaler based on the reports of the pyworkers.

"Inform automated scaling and routing decisions" is more correct

documentation/serverless/architecture.mdx

LucasArmandVast · 2026-02-02T18:19:08Z

documentation/serverless/architecture.mdx

-5. The PyWorker sends the model results back to the client.
-6. Independently and concurrently, each PyWorker in the Endpoint sends its operational metrics to the Serverless system, which it uses to make scaling decisions.
+- Authentication
+- Routing requests to appropriate workers


This is handled by the autoscaler, not the SDK. The SDK asks the autoscaler to "route" it to a worker, and then the SDK passes the request to that worker once it's routed.

LucasArmandVast · 2026-02-02T18:25:49Z

documentation/serverless/automatedperformancetesting.mdx

+
+## How Benchmark Testing Works
+
+When a new Workergroup is created, the serverless engine enters a **learning phase**. During this phase, it recruits a variety of machine types from those specified in `search_params`. For each new worker, the engine runs the user-configured benchmark and evaluates performance.


The "learning phase" doesn't literally exist, but is a label to basically say "Your endpoint is inefficient right now because we haven't gone through enough scaling cycles to optimize it yet."

Also, the PyWorker, not the Serverless Engine, is responsible for running the benchmark, the results of which are then reported to the Serverless Engine.

documentation/serverless/automatedperformancetesting.mdx

LucasArmandVast · 2026-02-02T18:32:16Z

documentation/serverless/automatedperformancetesting.mdx

+
+For examples of how to simulate load against your endpoint, see the client examples in the Vast SDK repository:
+
+https://github.com/vast-ai/vast-sdk/tree/main/examples/client


We can point to a load-specific example:
https://github.com/vast-ai/vast-sdk/blob/main/examples/client/vllm_load_example.py

LucasArmandVast · 2026-02-02T18:36:58Z

documentation/serverless/creating-new-pyworkers.mdx

  - Output:
    - A `float` representing workload; larger means “more expensive.”
+  - Recommendation:
+    - For many applications that have a vast majority of similarly complex tasks, utilizing a single constant value per task is sufficient (example cost = 100)


This sentence could be worded better.
"For applications where requests do not vary much in complexity, returning a constant value (i.e. 100) is often sufficient."

LucasArmandVast · 2026-02-02T18:40:39Z

documentation/serverless/managing-scale.mdx

@@ -0,0 +1,37 @@
+---
+title: Managing Scale
+description: Some configuration change strategies to manage different load scenarios.


"configuration change strategies" doesn't make obvious sense to me, but I think I get what you're going for.

Maybe "Learn how to configure your Serverless endpoint for different load scenarios" ?

LucasArmandVast · 2026-02-02T18:43:12Z

documentation/serverless/managing-scale.mdx

+      {
+        "@type": "HowToStep",
+        "name": "Manage for Bursty Load",
+        "text": "Adjust min_workers to increase managed inactive workers and add capacity for peak demand. Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions. Check max_workers to ensure it's set high enough for the serverless engine to create the required number of workers."


" Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions"

I don't really understand this explanation of using cold_mult. Cold_mult just sets the cold worker target to hot_worker_target * cold_mult (roughly). It always increases the number of workers, and the rate of worker recruitment is identical, just with a higher target.

LucasArmandVast · 2026-02-02T18:44:12Z

documentation/serverless/managing-scale.mdx

+## Managing for Bursty Load
+
+- **Adjust** `min_workers`: This will change the number of managed inactive workers, and increase capacity for high peak
+- **Increase** `cold_mult`: This will change the rate (and in extreme cases, the number) of worker recruitment. Use this to manage for fast transitions in demand


Same for comment above.

LucasArmandVast · 2026-02-02T18:44:38Z

documentation/serverless/managing-scale.mdx

+
+## Managing for Low Demand or Idle Periods
+
+- **Adjust** `min_load`: Reducing `min_load` will reduce the minimum number of active workers. Set to `1` to reduce the number to its minimum value of 1 worker


We support scaling to zero as of 2/3/26 (assuming we deploy)

LucasArmandVast · 2026-02-02T18:47:51Z

documentation/serverless/overview.mdx

 In the diagram's example, a user's client is attempting to infer from a machine learning model. With Vast's Serverless setup, the client:

-1. Sends a [`/route/`](/documentation/serverless/route) POST request to the serverless system. This asks the system for a GPU instance to send the inference request.
+1. Sends a `/route/` POST request to the serverless system. This asks the system for a GPU instance to send the inference request.


For consistency, we want to change "serverless system" to "Serverless engine"

documentation/serverless/overview.mdx

LucasArmandVast · 2026-02-02T18:51:45Z

documentation/serverless/pricing.mdx


-Vast.ai Serverless offers pay-as-you-go pricing for all workloads at the same rates as Vast.ai's non-Serverless GPU instances. Each instance accrues cost on a per second basis.
-This guide explains how pricing works.
+Unlike other providers, Vast Serverless offers pay-per-second pricing for all workloads at the same as Vast.ai’s non-Serverless GPU instances.


"at the same as" -> "at the same price as"

LucasArmandVast · 2026-02-02T18:53:21Z

documentation/serverless/pricing.mdx

+| State    | Description       | GPU compute | Storage | Bandwidth (in/out) |
+|----------|-------------------|-------------|---------|--------------------|
+| Ready    | An active worker  | Billed      | Billed  | Billed             |
+| Loading  | Model is loading  | Billed      | Billed  | Billed             |


"Starting" is included in "Loading", so users will also have to pay full price while their instance is starting up from a cold state.

But users never see "Starting", right? They will see Model Loading, which we should add here

"Model Loading" actually didn't make it into the new UI (maybe we need to talk to Madison about that), but yes users will see starting. Cold workers go from inactive -> starting -> ready, whereas new workers go from created -> loading -> ready

@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.

I'll sync up with Madison on it and make a documentation hotfix later

LucasArmandVast · 2026-02-02T18:59:21Z

documentation/serverless/SDKoverview.mdx

+
+## Why Use the SDK
+
+While there are other ways to interact with Vast Serverless—such as the CLI and the REST API—the SDK is the **most powerful and easiest** method to use. It is the recommended approach for most applications due to its higher-level abstractions, reliability, and ease of integration into Python-based workflows.


The SDK (in its current state) is not really a replacement for the CLI/UI, since they are still needed for endpoint/workergroup/template creation. Deployments project will change this.

It is true though that using the SDK is pre-packaged "correct" way to use the API, which still remains available for advanced users.

Done - replaced it with "...interact with a serverless endpoint..."

Colter-Downing · 2026-02-02T20:15:49Z

documentation/serverless/architecture.mdx

+<Frame caption="Serverless Architecture">
+![Serverless Architecture](/images/serverless-architecture.webp)
+</Frame>



This picture will need updating. The whole back-and-forth between the engine and the worker is now simplified through the SDK. Ie the SDK does this for you and you never really have to know about it. I guess it could still be important for the Anthony persona since he needs to know how all his requests are being handled.

Right now a conceptual diagram, can update it with SDK image in the next version.

Colter-Downing · 2026-02-02T20:28:00Z

documentation/serverless/workergroup-parameters.mdx

+
+The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set.
+
+## `gpu_ram`


Goes to the previously mentioned point, can we remove this and include in search_params somehow?

This is a question for devs

Parsing search_params might be weird, but I don't see why we couldn't do it. There is a real difference between "how many GB VRAM do you need" vs. "how many GB is your model weights", but they are probably close enough.

There is this whole infra built for tracking "additional disk usage" since pyworker start, and in theory "download progress" should be additional_disk_usage / model_size. Then we could provide approximate loading progress bars for each worker. I think this was an intended but unimplemented feature. It would probably look like changing this gpu_ram parameter to something like (optional) model_size.

Colter-Downing · 2026-02-02T20:46:35Z

documentation/serverless/serverless-parameters.mdx

-The Target Utilization ratio determines how much spare capacity (headroom) the serverless engine maintains. For example, if your predicted load 
-is 900 tokens/second and target\_util is 0.9, the serverless engine will plan for 1000 tokens/second of capacity (900 ÷ 0.9 = 1000), leaving 100 
-tokens/second (11%) as buffer for traffic spikes.
+## Minimum Inactive Workers (`min_workers`)


We should drop the 'Inactive'. This value is state agnostic, meaning 2 Ready and 3 Inactive still satisfies min_workers=5

Colter-Downing · 2026-02-02T20:53:26Z

documentation/serverless/workergroup-parameters.mdx

@@ -0,0 +1,69 @@
+---
+title: Worker Recruitment


This doc is actually a list of workergroup parameters. Maybe we rename the title to "Workergroup Parameters" in a similar fashion to the existing "Endpoint Parameters"?

robertfernandez-vast · 2026-02-01T17:48:55Z

documentation/serverless/architecture.mdx

 ---
-title: Architecture
-description: Understand the architecture of Vast.ai Serverless, including the Serverless System, GPU Instances, and User (Client Application). Learn how the system works, how to use the routing process, and how to create Worker Groups.
+title: Overview


We should rename the previous section to Serverless Feature Overview. This section should be named Architecture Overview. Having Serverless Overview followed Overview is a bit confusing.

I've renamed the "overview" to "architecture overview" and left "serverless overview" the way it is

robertfernandez-vast · 2026-02-01T17:57:25Z

documentation/serverless/architecture.mdx

+<Frame caption="Serverless Architecture">
+![Serverless Architecture](/images/serverless-architecture.webp)
+</Frame>


I think it would be clearer if each Pyworker <--> Model Inference Group was encapsulated in a box labeled GPU Instance.

I will redraw this diagram in the next revision of the documentation

robertfernandez-vast · 2026-02-01T18:12:18Z

documentation/serverless/architecture.mdx

+- Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util`

-You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system.
+Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development).


I prefer the term use case instead of function for this sentence. For a software dev, a function a specific term.

robertfernandez-vast · 2026-02-01T18:16:24Z

documentation/serverless/architecture.mdx

+An endpoint consists of:

-It's important to note that having multiple Worker Groups within a single Endpoint is not always necessary. For most users, a single Worker Group within an Endpoint provides an optimal setup.
+- A named endpoint identifier


Are you trying to capture endpoint name and endpoint id with this phrasing? I think The endpoint's name would be sufficient here.

The terminology is used in a separate place in the UI, I'm trying to keep consistent for now

robertfernandez-vast · 2026-02-01T18:18:24Z

documentation/serverless/architecture.mdx

+<Frame caption="Serverless Architecture">
+![Serverless Architecture](/images/serverless-architecture.webp)
+</Frame>


The diagram should show all primary components so a user knows where each important component fits into the this flow.

Will change diagram in v2

robertfernandez-vast · 2026-02-01T18:29:01Z

documentation/serverless/architecture.mdx

+While CLI and API access are available, the SDK is the recommended method for most applications.

-This 2-step routing process is used for security and flexibility. By having the client send payloads directly to the GPU instances, your payload information is never stored on Vast servers.
+## Example Workflow


This example workflow describes what happens after a dev client sets up an already has their serverless environment setup. This should be made clear so that a new dev client that is trying to set up serverless doesn't think this workflow is what they have to follow to setup serverless.

robertfernandez-vast · 2026-02-01T18:30:07Z

documentation/serverless/architecture.mdx

+2. The Serverless system routes the request and returns a suitable worker address based on current load and capacity.
+3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
+4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
+5. The inference result is returned to the client.


The inference result is returned to the client's application which is then forwarded to their users.

robertfernandez-vast · 2026-02-01T18:30:32Z

documentation/serverless/architecture.mdx

+3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
+4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
+5. The inference result is returned to the client.
+6. Independently and continuously, each PyWorker reports operational and performance metrics back to the Serverless Engine, which uses this data to make ongoing scaling decisions.


I like this sentence.

robertfernandez-vast · 2026-02-01T19:19:33Z

documentation/serverless/workergroup-parameters.mdx

+A command-line style string containing additional parameters for instance creation that will be parsed and applied when the serverless engine creates new workers. This allows you to customize instance configuration beyond what’s specified in templates.
+
+There is no default value for `launch_args`.


This parameter seems nebulous to me. We should have an example or something that tells developers what kind of format they should expect to use for this parameter.

I'll add one as a v2 in the upcoming documentation update

robertfernandez-vast · 2026-02-01T19:30:50Z

documentation/serverless/serverless-parameters.mdx

+- `target_util = 0.9` → 11.1% spare capacity
+- `target_util = 0.8` → 25% spare capacity
+- `target_util = 0.5` → 100% spare capacity
+- `target_util = 0.4` → 150% spare capacity


We should give a general equation so user's aren't constrained to only these examples.

The example above provides an equation to follow, don't think we need to do more here

robertfernandez-vast · 2026-02-02T23:16:48Z

documentation/serverless/pricing.mdx

+| State    | Description       | GPU compute | Storage | Bandwidth (in/out) |
+|----------|-------------------|-------------|---------|--------------------|
+| Ready    | An active worker  | Billed      | Billed  | Billed             |
+| Loading  | Model is loading  | Billed      | Billed  | Billed             |


@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.

robertfernandez-vast · 2026-02-02T23:18:19Z

documentation/serverless/pricing.mdx

-| Inactive | Not billed  | Billed  | Billed       | Billed        |
-
-GPU compute refers to the per-second GPU rental charges. See the [Billing Help](/documentation/reference/billing#ugwiY) page for rate details.
+| State    | Description                                                                                 | GPU compute |


Change the header label from GPU compute to Billing Description

robertfernandez-vast · 2026-02-02T23:21:15Z

documentation/serverless/pricing.mdx

+|-----------|--------------------------------------------------------------------------------------------|-------------|
+| Active    | - Engine is actively managing worker recruitment and release <br /> - Workers are active   | All workers billed at their relevant states      |
+| Suspended | - Engine is NOT managing worker recruitment and release <br /> - Workers are active.       | Workers are billed based on their state at time of suspension. <br /> Any workers that are currently being created or are loading, will complete to a ready state (and be billed as such).      |
+| Stopped   | - Engine is NOT managing worker recruitment and release <br /> - Workers are all inactive  | All workers are changed to and billed in inactive state  |


All workers are changed to and are billed in the inactive state.

robertfernandez-vast · 2026-02-02T23:49:22Z

documentation/serverless/serverless-parameters.mdx

+## Minimum Load (`min_load`)

-If not specified during endpoint creation, the default value is 3.
+Vast Serverless utilizes a concept of **load** as a metric of work that is performed by a worker, measured in performance (“perf”) per second. This is an internally computed value derived from benchmark tests and is normalized across different work types (tokens for LLMs, images for image generation, etc.). It is used to make scaling and capacity decisions.


I need some clarification about this from @LucasArmandVast and @Colter-Downing. Measuring load as perf per second doesn't feel accurate even though that's what is in the code base right now.

robertfernandez-vast · 2026-02-02T23:50:24Z

documentation/serverless/serverless-parameters.mdx

+### Best practice for setting `min_load`

-If not specified during endpoint creation, the default value is 5.
+- Start with `min_load = 1` (the default), which guarantees at least one active worker


We should mention that if a developer wants zero scaling (scaling to 0 hot workers), the min_load should be 0.

robertfernandez-vast · 2026-02-03T00:05:48Z

documentation/serverless/serverless-parameters.mdx

+## Cold Multiplier (`cold_mult`)

-The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set. 
+While `min_workers` is fixed regardless of traffic patterns, `cold_mult` defines inactive capacity as a multiplier of the current active workload.


The serverless engine attempts to plan its scaling for both predicted short term loads (1-30s) and for predicted long term loads (1 hour and above).

cold_mult is a scalar multiplier that allows developers to tune and plan for expected longer term loads.

cold_mult = (target_perf x target_util)/(predicted_load)

Sat with lucas and the submitted definition is probably more understandable for the average consumer. If people still confused, will update

LucasArmandVast · 2026-02-11T23:10:46Z

documentation/serverless/serverless-parameters.mdx

+## Minimum Load (`min_load`)

-If not specified during endpoint creation, the default value is 3.
+Vast Serverless utilizes a concept of **load** as a metric of work that is performed by a worker, measured in performance (“perf”) per second. This is an internally computed value derived from benchmark tests and is normalized across different work types (tokens for LLMs, images for image generation, etc.). It is used to make scaling and capacity decisions.


Units are "load per second", name is "perf". "perf per second" is like the second derivative of load.

Each chunk of work (a request) has a "load"
Each worker can do some amount of "load per second" <- this is the "perf"

LucasArmandVast · 2026-02-12T01:18:09Z

documentation/serverless/serverless-parameters.mdx

+If not specified during endpoint creation, the default value is `3`.

-There is no default value for launch\_args.
+## Minimum Cold Load (`min_cold_load`)


Maybe somewhere info that says "if min_workers, cold_mult, or target_util conflicts, the highest of the three is used"

LucasArmandVast · 2026-02-12T01:19:02Z

documentation/serverless/serverless-parameters.mdx


+## Target Utilization (`target_util`)

+Target Utilization defines the ratio of active capacity to anticipated load and determines how much spare capacity (headroom) is reserved to handle short-term traffic spikes.


Might be good to clarify that this is yet another way to set the number of inactive workers in relation to your active load.

LucasArmandVast · 2026-02-12T01:22:49Z

LGTM. Left a couple more comments, but then we are good to go.

…AUTO-1104-docs-update

Updating serverless documentation as part of rewrite

8f745f3

DavidatVast requested review from Colter-Downing, LucasArmandVast and robertfernandez-vast January 26, 2026 22:32

mintlify bot deployed to staging January 26, 2026 22:33 View deployment

LucasArmandVast reviewed Feb 2, 2026

View reviewed changes

documentation/serverless/architecture.mdx Show resolved Hide resolved