Problem
Each node currently owns an isolated /24 subnet with its IP pool persisted locally to pool.json. This design is simple and avoids coordination, but has operational drawbacks:
- No automatic IP reclamation — if a node dies, its IPs are stranded until manual intervention.
- Static pre-allocation — each node pre-allocates its full pool (e.g. 140 IPs) regardless of actual usage, wasting address space across the cluster.
- No global visibility — there is no single place to see IP allocation across all nodes; each node only knows its own state.
- No VM IP migration — when a VM moves between nodes, it cannot retain its IP.
Proposal
Add an optional centralized store (etcd or Redis) to manage IP pool state across nodes.
What changes
| Capability |
Today |
With centralized store |
| IP reclamation on node failure |
Manual |
Automatic (detect dead node, release IPs, reassign ENIs via cloud API) |
| IP utilization |
Per-node pre-allocation, potentially wasted |
On-demand allocation from global pool |
| Cluster-wide visibility |
Query each node individually |
Single source of truth |
| Subnet assignment |
Manual --subnet flag per node |
Auto-allocate /24 from a parent CIDR on node registration |
| VM IP migration |
Not supported |
Possible (with cloud API calls to reassign secondary IPs) |
What does NOT change
Per-node subnets remain mandatory on Volcengine. The VPC fabric's ARP proxy black-holes inbound cross-host traffic when secondary IPs share a subnet (documented in docs/volcengine.md). Separate /24 subnets per node force L3 routing, which is the only working path. A centralized store cannot work around this VPC-level constraint.
GKE alias IP ranges are already slices of a shared secondary range and are CIDR-based, so per-node CIDR blocks remain natural there as well.
Suggested key layout
/cocoon-net/config -> {parentCIDR, platform, ...}
/cocoon-net/subnets/{node} -> "172.20.100.0/24"
/cocoon-net/pool/{ip} -> {node, mac, expiry, vmID}
/cocoon-net/nodes/{node} -> {eniIDs, status, lastSeen}
Rough implementation plan
- Node registration — on
cocoon-net init or daemon startup, register with the store. If no --subnet is given, auto-allocate the next available /24 from the parent CIDR.
- Lease sync — the DHCP server writes lease state to both local file (crash recovery) and store (global visibility). Local file remains the fast path; store is async.
- Health checking — nodes heartbeat to the store. A watcher (could be a separate controller or leader-elected daemon) detects dead nodes and triggers IP/ENI reclamation via cloud APIs.
- Backward compatibility — the store is opt-in (
--store etcd://...). Without it, behavior is identical to today (local pool.json only).
Open questions
- etcd vs Redis vs something else? etcd is natural if the cluster already runs Kubernetes. Redis is simpler for standalone deployments.
- Should the store be authoritative (DHCP server reads from it) or advisory (local state is authoritative, store is a mirror)?
- Cloud API rate limits for IP reassignment — how fast can we actually reclaim and reassign on Volcengine?
Non-goals
- Eliminating per-node subnets (VPC routing constraint, not a software limitation)
- Real-time IP migration (cloud API latency makes this seconds-scale, not milliseconds)
Problem
Each node currently owns an isolated /24 subnet with its IP pool persisted locally to
pool.json. This design is simple and avoids coordination, but has operational drawbacks:Proposal
Add an optional centralized store (etcd or Redis) to manage IP pool state across nodes.
What changes
--subnetflag per nodeWhat does NOT change
Per-node subnets remain mandatory on Volcengine. The VPC fabric's ARP proxy black-holes inbound cross-host traffic when secondary IPs share a subnet (documented in
docs/volcengine.md). Separate /24 subnets per node force L3 routing, which is the only working path. A centralized store cannot work around this VPC-level constraint.GKE alias IP ranges are already slices of a shared secondary range and are CIDR-based, so per-node CIDR blocks remain natural there as well.
Suggested key layout
Rough implementation plan
cocoon-net initordaemonstartup, register with the store. If no--subnetis given, auto-allocate the next available /24 from the parent CIDR.--store etcd://...). Without it, behavior is identical to today (local pool.json only).Open questions
Non-goals