Skip to content

Centralized IP pool management with etcd/redis #2

@CMGS

Description

@CMGS

Problem

Each node currently owns an isolated /24 subnet with its IP pool persisted locally to pool.json. This design is simple and avoids coordination, but has operational drawbacks:

  • No automatic IP reclamation — if a node dies, its IPs are stranded until manual intervention.
  • Static pre-allocation — each node pre-allocates its full pool (e.g. 140 IPs) regardless of actual usage, wasting address space across the cluster.
  • No global visibility — there is no single place to see IP allocation across all nodes; each node only knows its own state.
  • No VM IP migration — when a VM moves between nodes, it cannot retain its IP.

Proposal

Add an optional centralized store (etcd or Redis) to manage IP pool state across nodes.

What changes

Capability Today With centralized store
IP reclamation on node failure Manual Automatic (detect dead node, release IPs, reassign ENIs via cloud API)
IP utilization Per-node pre-allocation, potentially wasted On-demand allocation from global pool
Cluster-wide visibility Query each node individually Single source of truth
Subnet assignment Manual --subnet flag per node Auto-allocate /24 from a parent CIDR on node registration
VM IP migration Not supported Possible (with cloud API calls to reassign secondary IPs)

What does NOT change

Per-node subnets remain mandatory on Volcengine. The VPC fabric's ARP proxy black-holes inbound cross-host traffic when secondary IPs share a subnet (documented in docs/volcengine.md). Separate /24 subnets per node force L3 routing, which is the only working path. A centralized store cannot work around this VPC-level constraint.

GKE alias IP ranges are already slices of a shared secondary range and are CIDR-based, so per-node CIDR blocks remain natural there as well.

Suggested key layout

/cocoon-net/config                          -> {parentCIDR, platform, ...}
/cocoon-net/subnets/{node}                  -> "172.20.100.0/24"
/cocoon-net/pool/{ip}                       -> {node, mac, expiry, vmID}
/cocoon-net/nodes/{node}                    -> {eniIDs, status, lastSeen}

Rough implementation plan

  1. Node registration — on cocoon-net init or daemon startup, register with the store. If no --subnet is given, auto-allocate the next available /24 from the parent CIDR.
  2. Lease sync — the DHCP server writes lease state to both local file (crash recovery) and store (global visibility). Local file remains the fast path; store is async.
  3. Health checking — nodes heartbeat to the store. A watcher (could be a separate controller or leader-elected daemon) detects dead nodes and triggers IP/ENI reclamation via cloud APIs.
  4. Backward compatibility — the store is opt-in (--store etcd://...). Without it, behavior is identical to today (local pool.json only).

Open questions

  • etcd vs Redis vs something else? etcd is natural if the cluster already runs Kubernetes. Redis is simpler for standalone deployments.
  • Should the store be authoritative (DHCP server reads from it) or advisory (local state is authoritative, store is a mirror)?
  • Cloud API rate limits for IP reassignment — how fast can we actually reclaim and reassign on Volcengine?

Non-goals

  • Eliminating per-node subnets (VPC routing constraint, not a software limitation)
  • Real-time IP migration (cloud API latency makes this seconds-scale, not milliseconds)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions