-
Notifications
You must be signed in to change notification settings - Fork 9
rfc17-user-bgp-status #3037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
rfc17-user-bgp-status #3037
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,149 @@ | ||
| # RFC 17 — User BGP Status | ||
|
|
||
| ## Summary | ||
|
|
||
| **Status: `Draft`** | ||
|
|
||
| This RFC closes the loop between Serviceability, DZD configuration, and the doublezerod | ||
| client connection by persisting the real BGP session state of each user onchain. | ||
| Today a user account reaches Activated status once resources are allocated and config | ||
| is pushed to the device, but there is no record of whether the user's doublezerod | ||
| actually established the BGP session. This gap makes it impossible to distinguish | ||
| users who are configured but never connected from users with healthy sessions. | ||
|
|
||
| ## Motivation | ||
|
|
||
| A user's lifecycle in DoubleZero crosses three layers: | ||
|
|
||
| 1. **Serviceability (onchain)** — tunnel, TunnelNet, DzIp are allocated and the user account reaches status = Activated. | ||
| 2. **DZD configuration** — controller renders BGP neighbor config and pushes it to Arista. | ||
| 3. **doublezerod connection** — the client establishes the GRE tunnel and BGP session. | ||
|
|
||
| Only after step 3 is the user truly connected. Today there is no observable onchain signal | ||
| for step 3. Persisting BGP status on the User account enables idle user detection and | ||
| connection diagnostics. | ||
|
|
||
| ## New Terminology | ||
|
|
||
| - **BGP Status** — `Unknown` (default), `Up` (session ESTABLISHED), `Down` (no active session). | ||
| - **Metrics Publisher** — keypair registered as `metrics_publisher_pk` on the Device account, | ||
| already used to authenticate telemetry writes to the telemetry program and state-ingest server. | ||
|
|
||
| ## Alternatives Considered | ||
|
|
||
| - **Store status in S3/ClickHouse only** — already done for raw socket stats. Not queryable | ||
| onchain and not accessible to other onchain programs. | ||
| - **Separate `UserBGPSession` PDA per user** — isolates BGP state from the User account and | ||
| avoids resizing it. Rejected because BGP status has a strict 1:1 relationship with the user, | ||
| the Device account must always be read to verify write authority regardless, and splitting the | ||
| data would require reading two accounts for every consumer that queries a user's connection | ||
| state. | ||
|
|
||
| ## Detailed Design | ||
|
|
||
| ### User account: new fields | ||
|
|
||
| Add three fields to the end of the `User` struct: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I kind of think this should live in a different account/structure, like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| 1. `bgp_status: BGPStatus` (1 byte, Borsh) — current session state: | ||
|
|
||
| Unknown = 0 (default, backwards-compatible) | ||
| Up = 1 | ||
| Down = 2 | ||
|
|
||
| 2. `last_bgp_up_at: u64` (8 bytes, DZ ledger slot) — the last slot when `bgp_status` | ||
| transitioned to `Up`. Zero means the session has never been observed Up. | ||
|
|
||
| 3. `last_bgp_reported_at: u64` (8 bytes, DZ ledger slot) — the last slot when the | ||
| telemetry agent successfully wrote a BGP status update for this user. Updated on | ||
| every `SetUserBGPStatus` write, whether or not `bgp_status` changed. Consumers can | ||
| use this field to detect agent silence: if `last_bgp_reported_at` is older than a | ||
| threshold, the `bgp_status` value should be treated as stale rather than | ||
| authoritative, avoiding false `Up` readings when the agent has stopped reporting. | ||
|
|
||
| The `SetUserBGPStatus` instruction reallocates the account by 17 bytes on first write | ||
| (1 + 8 + 8), with the metrics publisher covering any additional rent. `last_bgp_up_at` | ||
| is updated only when the status transitions to `Up`. | ||
|
|
||
| ### New instruction: SetUserBGPStatus (variant TBD) | ||
|
|
||
| Accounts: user (writable), device (readonly), metrics_publisher (signer + writable). | ||
|
|
||
| Validation: signer == device.metrics_publisher_pk, user.device_pk == device. | ||
|
|
||
| ### Telemetry collector | ||
|
|
||
| After each BGP socket collection tick in `collectBGPStateSnapshot`: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure we need to write this every
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if we update everything, I agree it should be slower. If we only write deltas, then I think it could be event driven with throttling.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should only write the state when the value has changed, or after a sufficiently long interval since the last write, to refresh the timestamp (slot). |
||
|
|
||
| 1. Fetch activated users for this device from the serviceability program. | ||
| 2. Map each user to its BGP peer IP: `overlay_dst_ip = user.TunnelNet[0:4]`, last octet +1. | ||
| 3. For each user: Up if a socket with matching RemoteIP exists, Down otherwise. | ||
| 4. For each user, submit `SetUserBGPStatus` if: (a) the computed status differs from | ||
| the last known onchain value, or (b) the last write was more than a configurable | ||
| interval ago (e.g., 1h), to keep `last_bgp_reported_at` fresh for staleness | ||
| detection. Submissions are enqueued into a non-blocking background worker that | ||
| retries independently so that a single RPC error does not delay other users or | ||
| block the collection tick. The metrics publisher keypair is already loaded in the | ||
| telemetry agent. | ||
|
|
||
| The raw TCP snapshot upload to S3 continues unchanged. | ||
|
|
||
| ## Impact | ||
|
|
||
| - Serviceability program: one new instruction, seventeen new bytes on User accounts (1 byte `bgp_status` + 8 bytes `last_bgp_up_at` + 8 bytes `last_bgp_reported_at`). | ||
| - Telemetry agent: one extra RPC call per collection tick to fetch users; up to N transactions | ||
| per tick (one per user whose status changed, or whose periodic refresh interval has elapsed). | ||
| - Read SDKs (Go, TypeScript, Python): update User deserialization for the new field. | ||
|
|
||
| ## Security Considerations | ||
|
|
||
| Authorization reuses the existing `metrics_publisher_pk` trust model, already enforced | ||
| by the telemetry program for latency sample writes. No new trust boundaries are introduced. | ||
|
|
||
| ## Backward Compatibility | ||
|
|
||
| Existing User accounts are 17 bytes shorter. The first SetUserBGPStatus write reallocates | ||
| the account. Until then, readers should treat the missing fields as Unknown and zero | ||
| respectively. SDK deserializers must handle the shorter account gracefully. | ||
|
|
||
| ## Rollout | ||
|
|
||
| Steps must be executed in order. Each step is a prerequisite for the next. | ||
|
|
||
| **Step 1 — Serviceability program upgrade** | ||
| Deploy the new program version with the `SetUserBGPStatus` instruction and the two new | ||
| fields on the `User` struct. No existing accounts are migrated; the new fields are | ||
| written lazily on first call. | ||
|
|
||
| **Step 2 — SDK updates** | ||
| Update read-only SDKs (Go, TypeScript, Python) to deserialize `bgp_status` and | ||
| `last_bgp_up_at`, defaulting both to zero when the account is shorter than the new | ||
| size. Must ship before step 3 to avoid deserialization errors in consumers that read | ||
| User accounts. | ||
|
|
||
| **Step 3 — Telemetry agent update** | ||
| Deploy the updated telemetry agent with the `SetUserBGPStatus` write client. From this | ||
| point on, each BGP collection tick writes status onchain. No rollback concern: if the | ||
| agent is reverted, fields simply stop updating and remain at their last known values. | ||
|
|
||
| **Step 4 — Verification** | ||
| On a devnet device with at least one activated user: | ||
| - Confirm `bgp_status` transitions to `Up` when the user's doublezerod is connected. | ||
| - Confirm `last_bgp_up_at` is set and matches the observed connection time. | ||
| - Disconnect the user and confirm `bgp_status` transitions to `Down` on the next tick. | ||
|
|
||
| ## Open Questions | ||
|
|
||
| - Should there be a grace period before marking a session `Down`? A single missed tick | ||
| due to a transient collection error would incorrectly transition an active user to | ||
| `Down`. One option is to require N consecutive `Down` observations before writing. | ||
| - Should we implement per-user rate limiting to prevent RPC saturation caused by | ||
| constant BGP flaps? A user cycling Up/Down rapidly would generate a transaction on | ||
| every state-change; a cooldown window or minimum time-between-writes per user account | ||
| could bound the worst-case submission rate. | ||
| - How should recurring circuit flaps be handled? A user whose BGP session repeatedly | ||
| drops and recovers within short windows may indicate an unstable circuit rather than | ||
| a transient error. Should the data model track a flap counter or a flap rate to | ||
| distinguish these users from those with a single clean Up or Down transition? And | ||
| should repeated flapping trigger any action, such as surfacing the user for manual | ||
| review? | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be RFC 18 I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a reference at the time of writing the RFC; it will be updated during implementation.