Scaler refactor by ferdzo · Pull Request #3 · ferdzo/aether

ferdzo · 2026-01-20T08:48:08Z

This pull request introduces a major architectural change to the Function-as-a-Service (FaaS) platform, splitting auto-scaling responsibilities between the gateway (scale-up) and worker nodes (scale-down). The gateway now tracks active requests and pushes scale-up jobs, while workers independently manage scale-down and job consumption based on their own capacity. The documentation, core gateway, and worker logic have all been updated to reflect this new design. Additionally, a new script is added for updating the root filesystem, and system utility improvements are included.

Split Auto-Scaling Architecture

Documentation (documentation/DESIGN.md): The scaling design is rewritten to describe the new split architecture, detailing how the gateway handles scale-up and workers handle scale-down, including new diagrams and configuration tables. [1] [2]

Gateway: Scale-Up Logic

New gateway scaler (gateway/internal/scaler.go): Implements a background scaler that tracks active requests per function, calculates the target number of instances, and pushes provision jobs to Redis when more instances are needed.
Router integration (gateway/internal/router.go, gateway/main.go): The router now tracks requests for scaling, and the scaler is started as a background process in the gateway's main entrypoint. [1] [2] [3] [4]

Worker: Scale-Down and Capacity Management

New worker reaper (worker/internal/reaper.go): Adds a reaper that periodically checks for idle instances and implements a two-tier scale-down strategy (to 1 hot instance after 30s, to zero after 5m idle).
Instance activity tracking (worker/internal/instance.go, worker/internal/proxy.go): Each instance now updates its last request time on every proxied request, enabling accurate idle detection for scale-down. [1] [2]
Removes old scaling config struct from worker config.

System Utilities and Scripts

Adds a new script for updating the root filesystem image, which builds the latest aether-env binary and installs it along with an init script. (scripts/update-rootfs.sh)
Adds a utility function to get available RAM from /proc/meminfo for worker capacity checks. (shared/system/system.go)

Minor Cleanups

Removes unused code and comments, and makes minor improvements in init and config files. [1] [2] [3]

This refactor lays the foundation for more reliable, scalable, and decoupled function instance management, with clear separation of concerns between control plane and worker nodes.

…f worker.

Copilot

Pull request overview

This pull request implements a major architectural change to the auto-scaling system, splitting responsibilities between the gateway (scale-up) and workers (scale-down). The gateway now tracks active requests and pushes provision jobs when more capacity is needed, while workers independently manage instance lifecycle and self-regulate job consumption based on local capacity constraints.

Changes:

Gateway implements new scale-up logic that tracks active requests and pushes provision jobs to Redis
Workers implement new scale-down "reaper" that manages two-tier idle timeout strategy (30s to 1 instance, 5m to zero)
Workers add capacity gates to prevent overconsumption based on instance count and available RAM
Documentation updated to reflect the new split architecture with diagrams and configuration tables

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
gateway/internal/scaler.go	New scaler that tracks active requests per function and pushes provision jobs when scale-up is needed
gateway/internal/router.go	Integration with scaler to track request start/completion
gateway/main.go	Initialization and startup of the scaler background loop
worker/internal/reaper.go	New reaper component implementing two-tier scale-down strategy
worker/internal/worker.go	Added capacity gate logic checking instance count and free RAM before consuming jobs
worker/internal/instance.go	Added Touch() and LastRequest() methods for tracking instance activity
worker/internal/proxy.go	Integration to update last request time on each proxied request
worker/internal/scaler.go	Removed old worker-side scaler implementation
worker/internal/config.go	Removed ScalingConfig struct (no longer needed)
shared/system/system.go	Added GetFreeRAM() utility for checking available memory
shared/protocol/messages.go	Formatting cleanup for struct field alignment
scripts/update-rootfs.sh	New script for updating root filesystem with aether-env binary
init/main.go	Minor comment cleanup
documentation/DESIGN.md	Updated architecture diagrams and scaling documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

worker/internal/worker.go

shared/system/system.go

scripts/update-rootfs.sh

worker/internal/reaper.go

gateway/internal/scaler.go

shared/system/system.go

worker/internal/reaper.go

gateway/internal/scaler.go

worker/internal/worker.go

Copilot

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 13 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-24T10:54:33Z

worker/internal/reaper.go

+			go func(fid, iid string) {
+				if err := r.worker.StopInstance(fid, iid); err != nil {
+					logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err)
+				}
+			}(functionID, inst.ID)


Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.

The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.

Copilot · 2026-01-24T10:54:33Z

worker/internal/reaper.go

+		go func(fid, iid string) {
+			if err := r.worker.StopInstance(fid, iid); err != nil {
+				logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err)
+			}
+		}(functionID, inst.ID)


Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.

The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.

worker/internal/proxy.go

shared/system/system.go

worker/internal/worker.go

scripts/update-rootfs.sh

documentation/DESIGN.md

Copilot · 2026-01-24T10:54:36Z

worker/internal/worker.go


 	logger.Info("watching queue", "queue", protocol.QueueVMProvision)

+	maxInstances := 10


Magic number: The maxInstances value is hardcoded to 10 (line 94), but this should be configurable or at least defined as a constant. The documentation (line 364) mentions "MaxInstances | 10 | Per-worker capacity limit", but having it hardcoded makes it difficult to adjust per deployment or environment.

Consider defining this as a configuration parameter or at minimum as a package-level constant with a clear name like DefaultMaxInstances or MaxInstancesPerWorker.

Copilot · 2026-01-24T10:54:36Z

worker/internal/worker.go

+				time.Sleep(500 * time.Millisecond)
+				continue
+			}
+
+			freeRAM, err := system.GetFreeRAM()
+			if err == nil && freeRAM < 500 {
+				logger.Info("low memory, waiting", "free_mb", freeRAM)
+				time.Sleep(500 * time.Millisecond)
+				continue


Inefficient polling: When at capacity or low memory, the worker sleeps for 500ms before checking again (lines 110, 117). With a 1-second BLPop timeout (line 121), this creates a cycle time of ~1.5 seconds when capacity-constrained. This polling approach is less efficient than event-driven alternatives.

Consider either: (1) using a longer sleep duration when at capacity to reduce CPU overhead, (2) implementing an exponential backoff, or (3) using a channel-based notification system when capacity frees up. However, this is acceptable for the current design and is not critical.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ferdzo added 2 commits January 19, 2026 23:13

Upgraded scaler design, main scaling logic moved to Gateway instead o…

de591f8

…f worker.

Removed comments

f5ccce0

ferdzo requested a review from Copilot January 20, 2026 08:48

Copilot started reviewing on behalf of ferdzo January 20, 2026 08:48 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

Resolved Copilot comments.

fef1af9

ferdzo requested a review from Copilot January 24, 2026 10:47

Copilot started reviewing on behalf of ferdzo January 24, 2026 10:47 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

ferdzo and others added 4 commits February 22, 2026 13:47

Update worker/internal/worker.go

a7c0fb3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update worker/internal/proxy.go

84367bb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update documentation/DESIGN.md

1d9fb39

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update scripts/update-rootfs.sh

78e2040

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


		logger.Info("watching queue", "queue", protocol.QueueVMProvision)

		maxInstances := 10

Conversation

ferdzo commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants