Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements a major architectural change to the auto-scaling system, splitting responsibilities between the gateway (scale-up) and workers (scale-down). The gateway now tracks active requests and pushes provision jobs when more capacity is needed, while workers independently manage instance lifecycle and self-regulate job consumption based on local capacity constraints.
Changes:
- Gateway implements new scale-up logic that tracks active requests and pushes provision jobs to Redis
- Workers implement new scale-down "reaper" that manages two-tier idle timeout strategy (30s to 1 instance, 5m to zero)
- Workers add capacity gates to prevent overconsumption based on instance count and available RAM
- Documentation updated to reflect the new split architecture with diagrams and configuration tables
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| gateway/internal/scaler.go | New scaler that tracks active requests per function and pushes provision jobs when scale-up is needed |
| gateway/internal/router.go | Integration with scaler to track request start/completion |
| gateway/main.go | Initialization and startup of the scaler background loop |
| worker/internal/reaper.go | New reaper component implementing two-tier scale-down strategy |
| worker/internal/worker.go | Added capacity gate logic checking instance count and free RAM before consuming jobs |
| worker/internal/instance.go | Added Touch() and LastRequest() methods for tracking instance activity |
| worker/internal/proxy.go | Integration to update last request time on each proxied request |
| worker/internal/scaler.go | Removed old worker-side scaler implementation |
| worker/internal/config.go | Removed ScalingConfig struct (no longer needed) |
| shared/system/system.go | Added GetFreeRAM() utility for checking available memory |
| shared/protocol/messages.go | Formatting cleanup for struct field alignment |
| scripts/update-rootfs.sh | New script for updating root filesystem with aether-env binary |
| init/main.go | Minor comment cleanup |
| documentation/DESIGN.md | Updated architecture diagrams and scaling documentation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| go func(fid, iid string) { | ||
| if err := r.worker.StopInstance(fid, iid); err != nil { | ||
| logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err) | ||
| } | ||
| }(functionID, inst.ID) |
There was a problem hiding this comment.
Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.
The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.
| go func(fid, iid string) { | ||
| if err := r.worker.StopInstance(fid, iid); err != nil { | ||
| logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err) | ||
| } | ||
| }(functionID, inst.ID) |
There was a problem hiding this comment.
Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.
The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.
|
|
||
| logger.Info("watching queue", "queue", protocol.QueueVMProvision) | ||
|
|
||
| maxInstances := 10 |
There was a problem hiding this comment.
Magic number: The maxInstances value is hardcoded to 10 (line 94), but this should be configurable or at least defined as a constant. The documentation (line 364) mentions "MaxInstances | 10 | Per-worker capacity limit", but having it hardcoded makes it difficult to adjust per deployment or environment.
Consider defining this as a configuration parameter or at minimum as a package-level constant with a clear name like DefaultMaxInstances or MaxInstancesPerWorker.
| time.Sleep(500 * time.Millisecond) | ||
| continue | ||
| } | ||
|
|
||
| freeRAM, err := system.GetFreeRAM() | ||
| if err == nil && freeRAM < 500 { | ||
| logger.Info("low memory, waiting", "free_mb", freeRAM) | ||
| time.Sleep(500 * time.Millisecond) | ||
| continue |
There was a problem hiding this comment.
Inefficient polling: When at capacity or low memory, the worker sleeps for 500ms before checking again (lines 110, 117). With a 1-second BLPop timeout (line 121), this creates a cycle time of ~1.5 seconds when capacity-constrained. This polling approach is less efficient than event-driven alternatives.
Consider either: (1) using a longer sleep duration when at capacity to reduce CPU overhead, (2) implementing an exponential backoff, or (3) using a channel-based notification system when capacity frees up. However, this is acceptable for the current design and is not critical.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request introduces a major architectural change to the Function-as-a-Service (FaaS) platform, splitting auto-scaling responsibilities between the gateway (scale-up) and worker nodes (scale-down). The gateway now tracks active requests and pushes scale-up jobs, while workers independently manage scale-down and job consumption based on their own capacity. The documentation, core gateway, and worker logic have all been updated to reflect this new design. Additionally, a new script is added for updating the root filesystem, and system utility improvements are included.
Split Auto-Scaling Architecture
documentation/DESIGN.md): The scaling design is rewritten to describe the new split architecture, detailing how the gateway handles scale-up and workers handle scale-down, including new diagrams and configuration tables. [1] [2]Gateway: Scale-Up Logic
gateway/internal/scaler.go): Implements a background scaler that tracks active requests per function, calculates the target number of instances, and pushes provision jobs to Redis when more instances are needed.gateway/internal/router.go,gateway/main.go): The router now tracks requests for scaling, and the scaler is started as a background process in the gateway's main entrypoint. [1] [2] [3] [4]Worker: Scale-Down and Capacity Management
worker/internal/reaper.go): Adds a reaper that periodically checks for idle instances and implements a two-tier scale-down strategy (to 1 hot instance after 30s, to zero after 5m idle).worker/internal/instance.go,worker/internal/proxy.go): Each instance now updates its last request time on every proxied request, enabling accurate idle detection for scale-down. [1] [2]System Utilities and Scripts
aether-envbinary and installs it along with an init script. (scripts/update-rootfs.sh)/proc/meminfofor worker capacity checks. (shared/system/system.go)Minor Cleanups
This refactor lays the foundation for more reliable, scalable, and decoupled function instance management, with clear separation of concerns between control plane and worker nodes.