Centralized IP pool management with etcd/redis

## Problem

Each node currently owns an isolated /24 subnet with its IP pool persisted locally to `pool.json`. This design is simple and avoids coordination, but has operational drawbacks:

- **No automatic IP reclamation** — if a node dies, its IPs are stranded until manual intervention.
- **Static pre-allocation** — each node pre-allocates its full pool (e.g. 140 IPs) regardless of actual usage, wasting address space across the cluster.
- **No global visibility** — there is no single place to see IP allocation across all nodes; each node only knows its own state.
- **No VM IP migration** — when a VM moves between nodes, it cannot retain its IP.

## Proposal

Add an optional centralized store (etcd or Redis) to manage IP pool state across nodes.

### What changes

| Capability | Today | With centralized store |
|---|---|---|
| IP reclamation on node failure | Manual | Automatic (detect dead node, release IPs, reassign ENIs via cloud API) |
| IP utilization | Per-node pre-allocation, potentially wasted | On-demand allocation from global pool |
| Cluster-wide visibility | Query each node individually | Single source of truth |
| Subnet assignment | Manual `--subnet` flag per node | Auto-allocate /24 from a parent CIDR on node registration |
| VM IP migration | Not supported | Possible (with cloud API calls to reassign secondary IPs) |

### What does NOT change

**Per-node subnets remain mandatory on Volcengine.** The VPC fabric's ARP proxy black-holes inbound cross-host traffic when secondary IPs share a subnet (documented in `docs/volcengine.md`). Separate /24 subnets per node force L3 routing, which is the only working path. A centralized store cannot work around this VPC-level constraint.

GKE alias IP ranges are already slices of a shared secondary range and are CIDR-based, so per-node CIDR blocks remain natural there as well.

### Suggested key layout

```
/cocoon-net/config                          -> {parentCIDR, platform, ...}
/cocoon-net/subnets/{node}                  -> "172.20.100.0/24"
/cocoon-net/pool/{ip}                       -> {node, mac, expiry, vmID}
/cocoon-net/nodes/{node}                    -> {eniIDs, status, lastSeen}
```

### Rough implementation plan

1. **Node registration** — on `cocoon-net init` or `daemon` startup, register with the store. If no `--subnet` is given, auto-allocate the next available /24 from the parent CIDR.
2. **Lease sync** — the DHCP server writes lease state to both local file (crash recovery) and store (global visibility). Local file remains the fast path; store is async.
3. **Health checking** — nodes heartbeat to the store. A watcher (could be a separate controller or leader-elected daemon) detects dead nodes and triggers IP/ENI reclamation via cloud APIs.
4. **Backward compatibility** — the store is opt-in (`--store etcd://...`). Without it, behavior is identical to today (local pool.json only).

### Open questions

- etcd vs Redis vs something else? etcd is natural if the cluster already runs Kubernetes. Redis is simpler for standalone deployments.
- Should the store be authoritative (DHCP server reads from it) or advisory (local state is authoritative, store is a mirror)?
- Cloud API rate limits for IP reassignment — how fast can we actually reclaim and reassign on Volcengine?

## Non-goals

- Eliminating per-node subnets (VPC routing constraint, not a software limitation)
- Real-time IP migration (cloud API latency makes this seconds-scale, not milliseconds)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralized IP pool management with etcd/redis #2

Problem

Proposal

What changes

What does NOT change

Suggested key layout

Rough implementation plan

Open questions

Non-goals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Capability	Today	With centralized store
IP reclamation on node failure	Manual	Automatic (detect dead node, release IPs, reassign ENIs via cloud API)
IP utilization	Per-node pre-allocation, potentially wasted	On-demand allocation from global pool
Cluster-wide visibility	Query each node individually	Single source of truth
Subnet assignment	Manual `--subnet` flag per node	Auto-allocate /24 from a parent CIDR on node registration
VM IP migration	Not supported	Possible (with cloud API calls to reassign secondary IPs)

Centralized IP pool management with etcd/redis #2

Description

Problem

Proposal

What changes

What does NOT change

Suggested key layout

Rough implementation plan

Open questions

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions