Skip to content

refactor(/agents): unify caller scope across cli/web/lark/telegram; remove fall-through filter (cross-user leak) #466

@eanzhao

Description

@eanzhao

Bug surfaced via Lark /agents, but the architectural fix generalizes the same shape ChannelUserConfigScope introduced in #436 / fixed /daily in #437.

Update (post-review tightening, 2026-04-27): original draft was right direction but not the cleanest endpoint. Five tightenings folded in below — (1) caller-scope the version reader too, (2) OwnerScope as proto sub-message not C# record, (3) canonical Platform value not nullable, (4) AgentBuilderTool decomposed into application service + adapters, not file-moved, (5) IUserAgentCatalogRuntimeQueryPort split into caller-scoped public port (no secret) + narrow internal IUserAgentDeliveryTargetReader (with NyxApiKey, only for outbound delivery). Runtime port split is first-PR scope, not follow-up.

Symptom

In Lark bot, /agents can return agents that belong to other Lark users, not just the caller.

Repro (current code):

  1. User A talks to Lark bot in private chat → /daily ... creates agent → OwnerNyxUserId = A.
  2. User B talks to Lark bot in their own private chat → /agents.
  3. Result: B may see A's agents in the list.

This compounds with the per-id ops (/agent-status, /run-agent, /disable-agent, /enable-agent, /delete-agent), which never verify ownership before acting on agent_id. As long as B knows A's agent_id (e.g. from the leaked listing), B can run / disable / delete it.

Root cause (the immediate bug)

Evidence anchors:

  • agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs:497 — owner filter resolved from LLM-overridable arg or from a nullable /me resolver.
  • agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs:805QueryAllAsync() + permissive .Where(...) fall-through.
  • agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs:512 (representative of agent_status / run_agent / disable_agent / enable_agent / delete_agent) — per-id ops never verify ownership.
  • agents/Aevatar.GAgents.ChannelRuntime/IUserAgentCatalogQueryPort.cs:5 (GetAsync) and :9 (QueryAllAsync) — root contract problem: neither carries caller identity.

AgentBuilderTool.cs:805-807:

var entries = await queryPort.QueryAllAsync(ct);
return entries
    .Where(x => string.IsNullOrWhiteSpace(ownerFilter)
                || string.Equals(x.OwnerNyxUserId, ownerFilter, StringComparison.Ordinal))

Permissive on missing owner — when ownerFilter is null/"", every catalog row passes. ownerFilter becomes null whenever ResolveCurrentUserIdAsync (AgentBuilderTool.cs:1016) silently returns null: NyxID /me error envelope (token expired, NyxID 5xx, network blip), malformed JSON, or id / user_id / sub missing.

AgentBuilderTool.cs:497:

var ownerFilter = args.Str(\"owner_nyx_user_id\")
                  ?? await ResolveCurrentUserIdAsync(nyxClient, token, ct);

owner_nyx_user_id is read as a tool arg even though it is not in the JSON schema (lines 47-117) — residual prompt-injection foot-gun.

The same shape exists on the runtime variant: IUserAgentCatalogRuntimeQueryPort.cs:9 (QueryAllAsync) is consumed by AgentDeliveryTargetTool (LLM tool) and by FeishuCardHumanInteractionPort (internal outbound delivery, needs NyxApiKey). Today the LLM tool can transitively reach NyxApiKey through the same reader.

Why this bug class keeps being possible

  • The contract IUserAgentCatalogQueryPort carries no caller identity. Every tool author has to remember to (a) resolve current user, (b) filter by owner, (c) fail-closed on missing owner, (d) verify ownership before per-id ops. Four "remember tos" against a contract with no way to express the requirement.
  • The runtime variant additionally bundles NyxApiKey into the same reader that LLM tools consume — secret boundary lives in convention, not in types.
  • Patching line 807 is a one-shot fix; the same drift recurs on every new method added to the same port.

Proposed direction — unify caller scope across all surfaces

Same /agents command across cli / web / lark / telegram. Each surface returns the agents that belong to me in this surface. Keep the abstraction uniform; let surface differ only at the resolver edge. Generalizes ChannelUserConfigScope (#436) from "per-user config" to "per-user ownership".

A. OwnerScope as a proto sub-message (not a C# record)

Per CLAUDE.md "序列化(强制)" + "核心语义强类型" + "先定义 .proto 并生成类型": OwnerScope carries cross-actor / cross-readmodel semantics, so it must be proto-first.

message OwnerScope {
  string nyx_user_id = 1;            // required
  string platform = 2;               // canonical: \"nyxid\" for native (cli/web), \"lark\", \"telegram\", ...
                                      // empty string is *invalid*; resolver must produce a canonical value.
  string registration_scope_id = 3;  // required iff platform != \"nyxid\"
  string sender_id = 4;              // required iff platform != \"nyxid\" — bot end-user (per-sender, not per-conversation)
}
  • Embed OwnerScope atomically into InitializeSkillRunnerCommand, InitializeWorkflowAgentCommand, the actor State (SkillRunnerState / WorkflowAgentState), the committed events those actors publish, UserAgentCatalogDocument, UserAgentCatalogEntry. Replace the scattered OwnerNyxUserId / Platform / ScopeId fields. Do not extend the existing fragmented field set — collapse them into the sub-message in one step.
  • ScopeId (the existing field) means "workflow / skill scope" today and is overloaded; either rename existing usage to WorkflowScopeId or scope the rename to OwnerScope.registration_scope_id and leave the unrelated ScopeId alone. Do not let OwnerScope.registration_scope_id collide with the existing ScopeId semantically.

B. Canonical Platform value, not nullable

Protobuf string defaults to empty; persisted records have no real null. Don't let the design depend on C# string? surviving the proto / readmodel boundary.

  • Define a closed set: \"nyxid\" (cli + web native) | \"lark\" | \"telegram\" | … (additions reviewed).
  • Platform = \"\" is rejected at resolver output and at command-handler ingress (validation error, not silent default).
  • Filter at the readmodel uses the canonical string equally — there is no "null matches null" branch to get wrong.

C. Caller-scoped query port — Get + Query + Version

public interface IUserAgentCatalogQueryPort
{
    Task<UserAgentCatalogEntry?> GetForCallerAsync(string agentId, OwnerScope caller, CancellationToken ct);
    Task<IReadOnlyList<UserAgentCatalogEntry>> QueryByCallerAsync(OwnerScope caller, CancellationToken ct);
    Task<long?> GetStateVersionForCallerAsync(string agentId, OwnerScope caller, CancellationToken ct);
}
  • GetStateVersionAsync(agentId) is removed from the public scoped port. Returning a version (or null) for a non-owned agent_id discloses existence and version-progression — a soft existence oracle. The version reader is either (a) caller-scoped here, returning null for non-owned IDs, or (b) moved to a narrow IUserAgentLifecycleVersionReader registered only for the in-process WaitForCreatedAgentAsync / WaitForTombstoneReflectedAsync paths and never injected into LLM tools. Pick one — the issue's open question 5 below.
  • GetForCallerAsync returns null for both "doesn't exist" and "not yours" — single semantic, no existence disclosure to non-owners.
  • QueryAllAsync() is deleted from the public port. Projector internals get a separate reader, narrowly registered, not visible to tools (see §D for the secret-bearing variant).
  • Owner predicate is pushed into IProjectionDocumentReader via ProjectionDocumentFilter strict-equality on the full OwnerScope tuple. No .Where(...) in application code.

D. Split the runtime port: public (no secret) + internal delivery reader (with secret)

IUserAgentCatalogRuntimeQueryPort today is dual-purpose: serves both AgentDeliveryTargetTool (LLM-facing) and FeishuCardHumanInteractionPort (internal outbound delivery — FeishuCardHumanInteractionPort.cs:418 consumes target.NyxApiKey). Two consumers, two trust levels, one type — that is the leak surface.

Split:

  1. IUserAgentCatalogQueryPort (public, caller-scoped) — what LLM tools see. Returns UserAgentCatalogEntry without NyxApiKey. The DTO drops the field; NyxApiKey is no longer in the public surface at all.
  2. IUserAgentDeliveryTargetReader (internal, by id, with secret) — narrow contract, by agent_id (or delivery-target-id), returns the credential-bearing record. Registered only for the outbound delivery path (FeishuCardHumanInteractionPort and analogous Telegram/web outbound implementations). Not injected into any LLM tool, not exposed via DI to the agent-builder/delivery-target tools. Architecture guard (full-scan): no type implementing IAgentTool may have IUserAgentDeliveryTargetReader as a constructor dependency.
  3. AgentDeliveryTargetTool migrates to the caller-scoped public port — same GetForCallerAsync / QueryByCallerAsync shape, no NyxApiKey in payloads (current implementation already masks via MaskSecret but the field is still in the shared UserAgentCatalogEntry DTO; remove it from the public DTO entirely).

This makes the secret boundary a type boundary, not a convention.

E. Agent ownership lives in actor state; projection materializes it

Per CLAUDE.md "权威状态 / Actor 即业务实体": OwnerScope is part of the agent's authoritative state on the owning actor (SkillRunnerGAgent / WorkflowAgentGAgent), set at Initialize*Command time, immutable thereafter. The committed event the actor publishes carries OwnerScope. UserAgentCatalogProjector writes the same tuple into the readmodel for queries. The catalog is a replica; ownership is not invented at projection time.

Aligns with #444 (UserAgentCatalogGAgent set-membership) — projector consumes committed events, doesn't mint state.

F. AgentBuilderTool — decompose, don't file-move

The current tool is heavily channel-coupled: AgentToolRequestContext, ChannelMetadataKeys, Lark receive-target resolution, private-chat restriction (AgentBuilderTool.cs:162-167), card flow, Nyx relay context. Moving the file alone pollutes a "surface-neutral" project with channel semantics.

Decompose into three roles:

  1. AgentBuilderApplicationService (surface-neutral) — create / list / status / lifecycle. Depends on:
    • ICallerScopeResolver (resolves OwnerScope),
    • IOutboundDeliveryTargetResolver (resolves where the agent's outputs go: Lark (receive_id, type), Telegram chat id, native sink, …),
    • IUserAgentCatalogQueryPort (caller-scoped public port, no secret),
    • IActorDispatchPort for Initialize*Command / lifecycle commands.
      No reference to AgentToolRequestContext, ChannelMetadataKeys, or any Lark/Telegram type.
  2. Channel adapters (ChannelRuntime stays the home of):
    • LarkChannelCallerScopeResolver — reads AgentToolRequestContext + ChannelMetadataKeys.SenderId + registration scope.
    • LarkOutboundDeliveryTargetResolver — current ResolveDeliveryTarget logic.
    • Card flow / slash-command dispatch (AgentBuilderCardFlow) stays here.
    • Private-chat restriction is enforced at the adapter (it's a Lark UX rule, not a domain rule).
  3. LLM tool wrapper — thin IAgentTool shell that calls the application service. Lives wherever its surface lives (Lark adapter for Lark, cli adapter for cli, etc.).

Net: the application service code doesn't know what surface it's on; the adapters know exactly one surface; the readmodel is shared.

G. Drop owner_nyx_user_id arg

Remove the LLM-overridable owner-id parameter entirely. The tool contract is "operate on my agents". Override surface = impersonation surface.

Surface table

Surface resolver output "my agents" semantics
aevatar-cli (NyxID login) OwnerScope { A, \"nyxid\", \"\", \"\" } A's NyxID-native agents
Web console same same
Lark bot, private chat { A, \"lark\", bot-1, lark-A } A's agents created in this lark bot
Lark group, A says /agents same A's only — B's agents not visible
Telegram analogous analogous

Strict full-tuple equality on read and on creation-time write. No fall-through.

Architectural alignment with CLAUDE.md

  • 读写分离 / 查询走 readmodel: caller-scoped query stays on projection.
  • 禁止侧读冒充 query: QueryAllAsync from the application path is the side-read pattern; remove it.
  • 设计完备性: "any 'missing → permissive' strategy must instead fail-closed and define ownership."
  • API 字段单一语义: OwnerScope is the typed sub-message, not scattered fields and not a Metadata bag.
  • 序列化(强制): OwnerScope is proto-first; embedded into commands / state / events / readmodel docs.
  • 抽象一旦能被滥用即设计未完成: QueryAllAsync() and the dual-purpose runtime port are exactly that — the architecture removes both foot-guns, not just patches the call sites.
  • 权威状态: ownership lives on the owning actor's state; projector materializes.
  • Consistent with bug(daily): GitHub username binding shared across all Lark users of one bot — last writer wins #436 (ChannelUserConfigScope = (RegistrationScopeId, Platform, SenderId)): OwnerScope extends to (NyxUserId, Platform, RegistrationScopeId, SenderId) and applies to ownership instead of just config.

Implementation scope (first PR; no follow-ups)

  1. Proto contracts: define OwnerScope message; embed into InitializeSkillRunnerCommand, InitializeWorkflowAgentCommand, the relevant actor state proto, the committed events, UserAgentCatalogDocument, UserAgentCatalogNyxCredentialDocument (if used for filtering). Remove the now-redundant scattered fields. Generate types.
  2. OwnerScope resolver layer: ICallerScopeResolver interface (returns OwnerScope, throws CallerScopeUnavailableException on failure — never returns "anonymous" / "everyone"). Implementations: LarkChannelCallerScopeResolver, TelegramChannelCallerScopeResolver, NyxIdNativeCallerScopeResolver.
  3. Public catalog query port: drop QueryAllAsync and unscoped GetAsync / GetStateVersionAsync from IUserAgentCatalogQueryPort; add GetForCallerAsync / QueryByCallerAsync / GetStateVersionForCallerAsync (or move the version reader to the internal lifecycle reader — open question 5). Push owner predicate to IProjectionDocumentReader filters as full-tuple Eq. Public DTO UserAgentCatalogEntry no longer carries NyxApiKey.
  4. Runtime port split: replace IUserAgentCatalogRuntimeQueryPort with:
    • the same caller-scoped public port (AgentDeliveryTargetTool migrates to it, drops owner_nyx_user_id arg),
    • new IUserAgentDeliveryTargetReader (internal, by id, returns the credential-bearing record). Registered only for outbound delivery (FeishuCardHumanInteractionPort, analogous Telegram outbound). Not visible to LLM tools.
    • architecture guard (full-scan): no IAgentTool impl may take IUserAgentDeliveryTargetReader as a ctor dep.
  5. AgentBuilderTool decomposition:
    • AgentBuilderApplicationService (surface-neutral, depends on caller-scope + delivery-target resolvers + caller-scoped catalog port + dispatch port).
    • IOutboundDeliveryTargetResolver interface; Lark/Telegram impls in the channel adapters.
    • LLM tool wrapper (AgentBuilderTool) becomes a thin IAgentTool over the application service.
    • Card flow / private-chat restriction stays in ChannelRuntime (Lark-specific UX).
    • Drop owner_nyx_user_id from ParametersSchema / arg parsing.
  6. Per-id ops: rewrite agent_status / run_agent / disable_agent / enable_agent / delete_agent (and RequireManagedAgentAsync) to use GetForCallerAsync. Non-owned agent_id → "not found".
  7. Tests:
    • cross-NyxID isolation (same surface, two NyxID users),
    • cross-surface isolation (same NyxID user, cli vs lark),
    • cross-sender isolation in Lark group,
    • caller-resolver failure → fail-closed (not "show all"),
    • per-id ops on non-owned agent_id return "not found" (no existence disclosure, no version disclosure),
    • architecture guard (full-scan): no application-layer code calls QueryAllAsync on the agent-catalog reader; no IAgentTool impl depends on IUserAgentDeliveryTargetReader; UserAgentCatalogEntry does not carry NyxApiKey.

Open questions

  1. cli/web semantics: when an aevatar-cli user runs /agents, do they see only Platform=\"nyxid\" agents (symmetric surface), or all NyxID-owned agents including lark/telegram (admin view)?
    • Recommendation: symmetric surface. cli/web/lark/telegram fully equivalent. Cross-surface visibility becomes an explicit --all-surfaces flag, not the default.
  2. Per-sender vs per-conversation in lark groups: align with ChannelUserConfigScope (per-sender), per bug(daily): GitHub username binding shared across all Lark users of one bot — last writer wins #436. ✅
  3. Migration: existing UserAgentCatalogDocument rows have only OwnerNyxUserId + Platform + ScopeId, no SenderId. Pre-bug(daily): GitHub username binding shared across all Lark users of one bot — last writer wins #436 lark agents likewise. Choose between (a) lazy backfill on first read, (b) explicit reprovision, (c) deprecate-and-recreate.
    • Recommendation: (c) for the lark surface (small footprint, agent UX is recreate-cheap), (a) for Platform=\"nyxid\" (cli/web — fewer rows, can backfill from OwnerNyxUserId alone since the scope is (NyxUserId, \"nyxid\", \"\", \"\")).
  4. ScopeId field collision: existing UserAgentCatalogDocument.ScopeId is workflow/skill scope, not registration scope. Either rename existing field (preferred — proto-first refactor) or namespace OwnerScope.registration_scope_id differently in the readmodel.
  5. State-version reader placement: keep on the caller-scoped public port (GetStateVersionForCallerAsync) or move to a narrow internal IUserAgentLifecycleVersionReader for the in-process WaitFor* paths only?
    • Recommendation: caller-scoped on the public port. WaitFor* runs inside the application service which already has the caller's OwnerScope; passing it through is free. Internal-reader option is cleaner in theory but adds a third reader interface for one method. Reconsider if a future caller of WaitFor* doesn't have a OwnerScope (none today).

Acceptance

  • No application-layer code can call a method that returns un-scoped agent data — enforced by IUserAgentCatalogQueryPort shape and an architecture-guard (full-scan).
  • No IAgentTool impl has IUserAgentDeliveryTargetReader (the secret-bearing reader) as a constructor dependency — full-scan guard.
  • UserAgentCatalogEntry (public DTO) does not carry NyxApiKey — full-scan guard on the field name.
  • OwnerScope is a proto sub-message embedded in commands / state / events / readmodel; no scattered OwnerNyxUserId + Platform + RegistrationScopeId fields remain on the public surface.
  • Platform is a canonical non-empty string at the resolver output and at command-handler ingress; empty rejected.
  • /agents in lark bot returns only the caller's (NyxUserId, \"lark\", RegistrationScopeId, SenderId)-scoped agents.
  • /agents in cli/web returns only (NyxUserId, \"nyxid\", \"\", \"\")-scoped agents.
  • /agent-status / /run-agent / /disable-agent / /enable-agent / /delete-agent return "not found" for non-owned agent_id (no existence disclosure, no version disclosure).
  • State-version reader (in whichever placement) does not return null vs not-null differently for owned vs non-owned agent_id.
  • Caller-resolver failure (NyxID /me 5xx, expired token, etc.) fails closed with an actionable error — never falls through to "all agents".
  • AgentBuilderApplicationService has no reference to AgentToolRequestContext / ChannelMetadataKeys / Lark / Telegram types.
  • AgentDeliveryTargetTool migrates to the caller-scoped public port; owner_nyx_user_id arg removed; tool no longer transitively reaches NyxApiKey.
  • Tests: cross-NyxID, cross-surface, cross-sender, resolver-failure, secret-boundary, version-existence-non-disclosure.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions