chore: use local database state for role management#2916
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
…-workos-fZao # Conflicts: # server/internal/access/impl.go # server/internal/access/listusergrants_test.go # server/internal/access/queries.sql # server/internal/access/repo/queries.sql.go # server/internal/attr/conventions.go # server/internal/auth/setup_test.go # server/internal/authz/engine_test.go # server/internal/organizations/setup_test.go # server/internal/xmcp/setup_test.go
🚀 Preview Environment (PR #2916)Preview URL: https://pr-2916.dev.getgram.ai
Gram Preview Bot |
…-workos-fZao # Conflicts: # server/internal/background/activities/process_workos_org_events.go
NOTE
I went through multiple rounds of reviews between me, claude and codex - so pretty confident about these changes.
Some things (Such as workOS outbox hardening) I'm deferring for later because otherwise this becomes an even large PR.
What
This PR finalizes the work around WorkOS sync by treating our database as source of truth. We are now using role URNs for principals (rather than slugs). Because this is a breaking change, we need to support the dual path for now (I'll work on a script to backfill so we can drop this multi support).
Key changes:
impl.goto a dedicated manager. This makes it reusable (probably not going to be needed) but most of all separates the concerns between normal endpoint handling and actual role management (impl was becoming quite hard to follow)Out of scope / follow-ups
role:<slug>backfill (AGE-1954): This PR introduces a dual-read/dual-delete for legacy slug-based principal URNs and canonicalrole:<kind>:<uuid>URNs. The backfill script that moves existingprincipal_grantsrows fromrole:<slug>to the canonical URN is not yet written. Until then every grant read hits two principal keys. Owner: . Target: .RolePrincipalslegacy fallback, drop the dual-delete inDeleteRoleGrants, and stop dual-reading inengine.PrepareContextandaccess.ListGrants.ListAccessMembers: Currently returns every member + a LATERAL role lookup per row in one query. Will not scale for enterprise orgs with thousands of members; pagination + total-count is a follow-up.Durability of WorkOS sync
Local writes are committed first; WorkOS writes happen in a detached goroutine (
context.WithoutCancel) with a bounded in-process retry (3 attempts, ≤300 ms backoff). This means:tx.Commitand the goroutine completing, WorkOS will diverge from local state with no automatic reconciliation.This is acceptable as an interim step because the local DB is now the authoritative read source for role membership and grants, but it is not a long-term design. Follow-up: move WorkOS writes to an outbox table written inside the same transaction, drained by a Temporal workflow with backoff in the minutes/hours range (Temporal is already wired up in
server/internal/background/activities/).Read-path performance
The previous implementation cached role + grant lookups in-process. This PR removes that cache in favor of direct DB reads. Per RBAC-enforced request the authz engine now executes:
users.GetUser(PK lookup)ListMemberRolePrincipalsByWorkosUser(assignment + LEFT JOIN org_roles + LEFT JOIN global_roles)GetPrincipalGrants(indexed lookup on(organization_id, principal_urn))Indexes used:
organization_role_assignments_org_workos_user_role_key(partial unique) covers (2)principal_grants_org_principal_scope_selector_keycovers (3)No load test has been run against this change. If p99 latency or PG QPS regresses, the planned mitigations are (in order of preference): (a) collapse (2) + (3) into a single query that joins assignments to grants by
workos_user_idand returns grants directly, (b) short-TTL LRU keyed by(org, userID)invalidated on role mutation.Bootstrap dependency
EnableRBACcallsSeedSystemRoleGrants, which resolves "admin"/"member" viaGetActiveOrganizationRoleBySlug. These slugs live inglobal_roles, which is populated by WorkOS webhook events. An org cannot have RBAC enabled until the global roles have been synced from WorkOS. Today this is implicit and there's no preflight check; documenting it here so it's not surprising the next time we bring a new environment online. These roles exist in production today.