Skip to content

Conversation

@DavidatVast
Copy link
Contributor

Self explanatory - rewrote / re-ordered serverless documentation

https://vastai.atlassian.net/browse/AUTO-1104

@robertfernandez-vast
Copy link
Contributor

Developers want to use serverless. Based off of this fact, the following should be assumed true:

  1. For a developer to use serverless, they must know how to set-up their serverless environment.
  2. For a developer to use serverless, they must know how to interact with their serverless environment.
  3. For a developer to use serverless, they must know how to maintain their serverless environment.

All sections should serve and be organized based off of these statements.

Users can set-up their serverless environment via:

  • The vast website
  • The vast-sdk
  • The vast cli

We should have sections which serve each of these form factors.

Setting up their serverless environment includes:

  • Creating and destroying endpoints and what that does to their data and serverless environment
  • Creating and destroying workergroups and what that does to their data and serverless environment
  • Using our predefined templates or defining custom templates
    • This includes instructing developers on how to make their own pyworkers for their own use cases.
  • Addressing frequently occurring questions/pain points with set-up
    • Why does it take so long for workers to load?
    • How do I know when my serverless environment is ready for use?
    • How do I know what serverless environment is best suited for my use case?
  • Definitions for each worker state and how to understand how to use these states to get your serverless environment setup.

Users can interact with their serverless environment via:

  • The vast-sdk
  • The vast-cli

We should organize the sections so that it is clear to developers how they interact with serverless for each form factor.

Interacting includes:

  • How to send requests to your endpoint group once its set-up

Users can maintain their serverless environment via:

  • The vast website
  • The vast-sdk
  • The vast-cli

We should organize the sections so that it is clear how they maintain their serverless environment.

Maintaining includes:

  • Paying for their serverless environment
  • Keeping their pyworkers up-to-date
  • Best Practices for keeping their data safe (ex: backing up data)
  • Describing what our metrics mean on the endpoint page, and what assumptions go into each metric. (Units, limitations, definitions)
  • How to understand how your serverless environment is behaving based on worker states.

- Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util`

You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system.
Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to explain here that endpoints act as "routing targets", and that all requests sent to an endpoint are load-balanced across that endpoints workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

### Workergroups

The system architecture for an application using Vast.ai Serverless includes the following components:
A **Workergroup** defines how workers are recruited and created. Workergroups are configured with [**workergroup-level parameters**](./workergroup-parameters) and are responsible for selecting which GPU offers are eligible for worker creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than just GPU selection, workergroups decide what code is actually running on the endpoint via the template. It's probably important to mention this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


### Workers

**Workers** are individual GPU instances created and managed by the Serverless engine. Each worker runs a [**PyWorker**](./overview), a Python web server that loads the machine learning model and serves requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyWorker does not load the machine learning model. Both the pyworker and the "machine learning model" (maybe a better, more generic name for this?) are launched by the Docker entrypoint or on_start.sh script. The pyworker runs in parallel with the machine learning model and listens to it (via it's log output). It reports the readiness of the model to the autoscaler, and acts as a proxy for incoming requests on the worker, keeping track of requests and passing them along to the model server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking Lucas' comment, a more accurate sentence might be:
"Each worker runs a PyWorker, a Python web server that runs the machine learning model's on-start script, orchestrates hardware benchmarking, and serves requests."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the PyWorker does not run the machine learning model's on-start script. Both the PyWorker and the machine learning model are started completely separately, both from the same on-start script. Two different processes, each started from the same script. They do not invoke one another. They communicate only by logs and passing HTTP requests along.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to:
Workers are individual GPU instances created and managed by the Serverless engine. Each worker runs a PyWorker, a Python web server that monitors the inference server's readiness, proxies incoming requests, and coordinates with the autoscaler.


- Receiving and processing inference requests
- Reporting performance metrics (load, utilization, benchmark results)
- Participating in automated scaling and routing decisions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"- Participating in automated scaling and routing decisions"
This is vague and only sort-of true. They do "participate" in the sense that they report information on their health, current load, dropped requests, and benchmark speed, but outside of just reporting that information, they don't actually take action for routing or scaling. All decision making is handled by the autoscaler based on the reports of the pyworkers.

"Inform automated scaling and routing decisions" is more correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

5. The PyWorker sends the model results back to the client.
6. Independently and concurrently, each PyWorker in the Endpoint sends its operational metrics to the Serverless system, which it uses to make scaling decisions.
- Authentication
- Routing requests to appropriate workers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handled by the autoscaler, not the SDK. The SDK asks the autoscaler to "route" it to a worker, and then the SDK passes the request to that worker once it's routed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## How Benchmark Testing Works

When a new Workergroup is created, the serverless engine enters a **learning phase**. During this phase, it recruits a variety of machine types from those specified in `search_params`. For each new worker, the engine runs the user-configured benchmark and evaluates performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "learning phase" doesn't literally exist, but is a label to basically say "Your endpoint is inefficient right now because we haven't gone through enough scaling cycles to optimize it yet."

Also, the PyWorker, not the Serverless Engine, is responsible for running the benchmark, the results of which are then reported to the Serverless Engine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


For examples of how to simulate load against your endpoint, see the client examples in the Vast SDK repository:

https://github.com/vast-ai/vast-sdk/tree/main/examples/client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- Output:
- A `float` representing workload; larger means “more expensive.”
- Recommendation:
- For many applications that have a vast majority of similarly complex tasks, utilizing a single constant value per task is sufficient (example cost = 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence could be worded better.
"For applications where requests do not vary much in complexity, returning a constant value (i.e. 100) is often sufficient."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,37 @@
---
title: Managing Scale
description: Some configuration change strategies to manage different load scenarios.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"configuration change strategies" doesn't make obvious sense to me, but I think I get what you're going for.

Maybe "Learn how to configure your Serverless endpoint for different load scenarios" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{
"@type": "HowToStep",
"name": "Manage for Bursty Load",
"text": "Adjust min_workers to increase managed inactive workers and add capacity for peak demand. Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions. Check max_workers to ensure it's set high enough for the serverless engine to create the required number of workers."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions"

I don't really understand this explanation of using cold_mult. Cold_mult just sets the cold worker target to hot_worker_target * cold_mult (roughly). It always increases the number of workers, and the rate of worker recruitment is identical, just with a higher target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## Managing for Bursty Load

- **Adjust** `min_workers`: This will change the number of managed inactive workers, and increase capacity for high peak
- **Increase** `cold_mult`: This will change the rate (and in extreme cases, the number) of worker recruitment. Use this to manage for fast transitions in demand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## Managing for Low Demand or Idle Periods

- **Adjust** `min_load`: Reducing `min_load` will reduce the minimum number of active workers. Set to `1` to reduce the number to its minimum value of 1 worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support scaling to zero as of 2/3/26 (assuming we deploy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

In the diagram's example, a user's client is attempting to infer from a machine learning model. With Vast's Serverless setup, the client:

1. Sends a [`/route/`](/documentation/serverless/route) POST request to the serverless system. This asks the system for a GPU instance to send the inference request.
1. Sends a `/route/` POST request to the serverless system. This asks the system for a GPU instance to send the inference request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, we want to change "serverless system" to "Serverless engine"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Vast.ai Serverless offers pay-as-you-go pricing for all workloads at the same rates as Vast.ai's non-Serverless GPU instances. Each instance accrues cost on a per second basis.
This guide explains how pricing works.
Unlike other providers, Vast Serverless offers pay-per-second pricing for all workloads at the same as Vast.ai’s non-Serverless GPU instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"at the same as" -> "at the same price as"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

| State | Description | GPU compute | Storage | Bandwidth (in/out) |
|----------|-------------------|-------------|---------|--------------------|
| Ready | An active worker | Billed | Billed | Billed |
| Loading | Model is loading | Billed | Billed | Billed |
Copy link
Contributor

@LucasArmandVast LucasArmandVast Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Starting" is included in "Loading", so users will also have to pay full price while their instance is starting up from a cold state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But users never see "Starting", right? They will see Model Loading, which we should add here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Model Loading" actually didn't make it into the new UI (maybe we need to talk to Madison about that), but yes users will see starting. Cold workers go from inactive -> starting -> ready, whereas new workers go from created -> loading -> ready

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll sync up with Madison on it and make a documentation hotfix later


## Why Use the SDK

While there are other ways to interact with Vast Serverless—such as the CLI and the REST API—the SDK is the **most powerful and easiest** method to use. It is the recommended approach for most applications due to its higher-level abstractions, reliability, and ease of integration into Python-based workflows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK (in its current state) is not really a replacement for the CLI/UI, since they are still needed for endpoint/workergroup/template creation. Deployments project will change this.

It is true though that using the SDK is pre-packaged "correct" way to use the API, which still remains available for advanced users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - replaced it with "...interact with a serverless endpoint..."

<Frame caption="Serverless Architecture">
![Serverless Architecture](/images/serverless-architecture.webp)
</Frame>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This picture will need updating. The whole back-and-forth between the engine and the worker is now simplified through the SDK. Ie the SDK does this for you and you never really have to know about it. I guess it could still be important for the Anthony persona since he needs to know how all his requests are being handled.

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now a conceptual diagram, can update it with SDK image in the next version.


The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set.

## `gpu_ram`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goes to the previously mentioned point, can we remove this and include in search_params somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a question for devs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing search_params might be weird, but I don't see why we couldn't do it. There is a real difference between "how many GB VRAM do you need" vs. "how many GB is your model weights", but they are probably close enough.

There is this whole infra built for tracking "additional disk usage" since pyworker start, and in theory "download progress" should be additional_disk_usage / model_size. Then we could provide approximate loading progress bars for each worker. I think this was an intended but unimplemented feature. It would probably look like changing this gpu_ram parameter to something like (optional) model_size.

The Target Utilization ratio determines how much spare capacity (headroom) the serverless engine maintains. For example, if your predicted load
is 900 tokens/second and target\_util is 0.9, the serverless engine will plan for 1000 tokens/second of capacity (900 ÷ 0.9 = 1000), leaving 100
tokens/second (11%) as buffer for traffic spikes.
## Minimum Inactive Workers (`min_workers`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should drop the 'Inactive'. This value is state agnostic, meaning 2 Ready and 3 Inactive still satisfies min_workers=5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,69 @@
---
title: Worker Recruitment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is actually a list of workergroup parameters. Maybe we rename the title to "Workergroup Parameters" in a similar fashion to the existing "Endpoint Parameters"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

---
title: Architecture
description: Understand the architecture of Vast.ai Serverless, including the Serverless System, GPU Instances, and User (Client Application). Learn how the system works, how to use the routing process, and how to create Worker Groups.
title: Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename the previous section to Serverless Feature Overview. This section should be named Architecture Overview. Having Serverless Overview followed Overview is a bit confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the "overview" to "architecture overview" and left "serverless overview" the way it is

Comment on lines +24 to +26
<Frame caption="Serverless Architecture">
![Serverless Architecture](/images/serverless-architecture.webp)
</Frame>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer if each Pyworker <--> Model Inference Group was encapsulated in a box labeled GPU Instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will redraw this diagram in the next revision of the documentation

- Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util`

You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system.
Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the term use case instead of function for this sentence. For a software dev, a function a specific term.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

An endpoint consists of:

It's important to note that having multiple Worker Groups within a single Endpoint is not always necessary. For most users, a single Worker Group within an Endpoint provides an optimal setup.
- A named endpoint identifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to capture endpoint name and endpoint id with this phrasing? I think The endpoint's name would be sufficient here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology is used in a separate place in the UI, I'm trying to keep consistent for now

Comment on lines +24 to +26
<Frame caption="Serverless Architecture">
![Serverless Architecture](/images/serverless-architecture.webp)
</Frame>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram should show all primary components so a user knows where each important component fits into the this flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change diagram in v2

While CLI and API access are available, the SDK is the recommended method for most applications.

This 2-step routing process is used for security and flexibility. By having the client send payloads directly to the GPU instances, your payload information is never stored on Vast servers.
## Example Workflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example workflow describes what happens after a dev client sets up an already has their serverless environment setup. This should be made clear so that a new dev client that is trying to set up serverless doesn't think this workflow is what they have to follow to setup serverless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

2. The Serverless system routes the request and returns a suitable worker address based on current load and capacity.
3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
5. The inference result is returned to the client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inference result is returned to the client's application which is then forwarded to their users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
5. The inference result is returned to the client.
6. Independently and continuously, each PyWorker reports operational and performance metrics back to the Serverless Engine, which uses this data to make ongoing scaling decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this sentence.

Comment on lines +43 to +45
A command-line style string containing additional parameters for instance creation that will be parsed and applied when the serverless engine creates new workers. This allows you to customize instance configuration beyond what’s specified in templates.

There is no default value for `launch_args`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter seems nebulous to me. We should have an example or something that tells developers what kind of format they should expect to use for this parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add one as a v2 in the upcoming documentation update

- `target_util = 0.9` → 11.1% spare capacity
- `target_util = 0.8` → 25% spare capacity
- `target_util = 0.5` → 100% spare capacity
- `target_util = 0.4` → 150% spare capacity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should give a general equation so user's aren't constrained to only these examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example above provides an equation to follow, don't think we need to do more here

| State | Description | GPU compute | Storage | Bandwidth (in/out) |
|----------|-------------------|-------------|---------|--------------------|
| Ready | An active worker | Billed | Billed | Billed |
| Loading | Model is loading | Billed | Billed | Billed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.

| Inactive | Not billed | Billed | Billed | Billed |

GPU compute refers to the per-second GPU rental charges. See the [Billing Help](/documentation/reference/billing#ugwiY) page for rate details. No newline at end of file
| State | Description | GPU compute |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the header label from GPU compute to Billing Description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

|-----------|--------------------------------------------------------------------------------------------|-------------|
| Active | - Engine is actively managing worker recruitment and release <br /> - Workers are active | All workers billed at their relevant states |
| Suspended | - Engine is NOT managing worker recruitment and release <br /> - Workers are active. | Workers are billed based on their state at time of suspension. <br /> Any workers that are currently being created or are loading, will complete to a ready state (and be billed as such). |
| Stopped | - Engine is NOT managing worker recruitment and release <br /> - Workers are all inactive | All workers are changed to and billed in inactive state |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All workers are changed to and are billed in the inactive state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## Minimum Load (`min_load`)

If not specified during endpoint creation, the default value is 3.
Vast Serverless utilizes a concept of **load** as a metric of work that is performed by a worker, measured in performance (“perf”) per second. This is an internally computed value derived from benchmark tests and is normalized across different work types (tokens for LLMs, images for image generation, etc.). It is used to make scaling and capacity decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some clarification about this from @LucasArmandVast and @Colter-Downing. Measuring load as perf per second doesn't feel accurate even though that's what is in the code base right now.

### Best practice for setting `min_load`

If not specified during endpoint creation, the default value is 5.
- Start with `min_load = 1` (the default), which guarantees at least one active worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention that if a developer wants zero scaling (scaling to 0 hot workers), the min_load should be 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## Cold Multiplier (`cold_mult`)

The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set.
While `min_workers` is fixed regardless of traffic patterns, `cold_mult` defines inactive capacity as a multiplier of the current active workload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serverless engine attempts to plan its scaling for both predicted short term loads (1-30s) and for predicted long term loads (1 hour and above).

cold_mult is a scalar multiplier that allows developers to tune and plan for expected longer term loads.

cold_mult = (target_perf x target_util)/(predicted_load)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sat with lucas and the submitted definition is probably more understandable for the average consumer. If people still confused, will update

## Minimum Load (`min_load`)

If not specified during endpoint creation, the default value is 3.
Vast Serverless utilizes a concept of **load** as a metric of work that is performed by a worker, measured in performance (“perf”) per second. This is an internally computed value derived from benchmark tests and is normalized across different work types (tokens for LLMs, images for image generation, etc.). It is used to make scaling and capacity decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Units are "load per second", name is "perf". "perf per second" is like the second derivative of load.

Each chunk of work (a request) has a "load"
Each worker can do some amount of "load per second" <- this is the "perf"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

If not specified during endpoint creation, the default value is `3`.

There is no default value for launch\_args.
## Minimum Cold Load (`min_cold_load`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe somewhere info that says "if min_workers, cold_mult, or target_util conflicts, the highest of the three is used"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## Target Utilization (`target_util`)

Target Utilization defines the ratio of active capacity to anticipated load and determines how much spare capacity (headroom) is reserved to handle short-term traffic spikes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to clarify that this is yet another way to set the number of inactive workers in relation to your active load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@LucasArmandVast
Copy link
Contributor

LucasArmandVast commented Feb 12, 2026

LGTM. Left a couple more comments, but then we are good to go.

@DavidatVast DavidatVast merged commit 16ca314 into main Feb 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants