Skip to content

fix(abr): 让 LLM 会话静态状态拥有进程生命周期以避免关机崩溃#694

Merged
qiin2333 merged 2 commits into
masterfrom
fix/abr-llm-worker-static-lifetime
May 29, 2026
Merged

fix(abr): 让 LLM 会话静态状态拥有进程生命周期以避免关机崩溃#694
qiin2333 merged 2 commits into
masterfrom
fix/abr-llm-worker-static-lifetime

Conversation

@qiin2333
Copy link
Copy Markdown
Collaborator

问题

ABR 的 LLM 决策跑在 detached 线程上(abr.cpp process_feedbackstd::thread(llm_worker, ...).detach()),而其 HTTP 调用的 CURLOPT_TIMEOUT 是 300 秒。因此进程开始关闭时,该 worker 可能仍在飞行(最长 5 分钟)。

llm_worker 在 curl 返回后会重新获取 sessions_mutex 并访问 sessions map。如果此时这两个文件作用域静态对象已被销毁(静态析构顺序),就是一次 use-after-free,可能导致进程退出时崩溃。

注:正常运行期的 use-after-free 已被 generation 校验防住(worker 重新加锁后校验 it->second.generation != generation);本问题仅在关机期撞上在途 LLM 调用时触发。

修复

sessions_mutexsessions 改为进程生命周期分配(故意不释放),这样即使有 worker 在关机后才返回,访问的也始终是有效内存。

  • 改动仅 2 行(外加注释),通过引用绑定,所有调用点不变
  • 一次性泄漏属可接受——这是进程级单例状态,标准做法。
  • 不阻塞关机(不需要 join 长达 5 分钟的在途线程)。

验证

ninja sunshine 编译通过,abr.cpp.obj 干净编译,sunshine.exe 正常链接。

…rash

The LLM worker runs on a detached thread and its HTTP call has a 300s
timeout, so it can still be in flight when the process starts shutting
down. On return it reacquires sessions_mutex and touches the sessions map;
if those file-scope statics were already destroyed, that is a
static-destruction-order use-after-free that can crash on exit.

Allocate sessions_mutex and sessions with process lifetime (intentionally
never freed) so a late-returning worker is always safe. All call sites are
unchanged (they bind to references).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 21e1fac2-57d1-4822-8572-7402a64a8b62

📥 Commits

Reviewing files that changed from the base of the PR and between 2a113e0 and 57a12fc.

📒 Files selected for processing (1)
  • src/abr.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/abr.cpp

Summary by CodeRabbit

Bug Fixes

  • 修复了应用关闭阶段的潜在稳定性问题,减少后台任务在进程终止时引发的崩溃风险,提升整体可靠性。
  • 缓解了全局日志在晚启动后台任务时可能出现的极窄竞态,改善相关错误可见性和日志稳定性。

Walkthrough

src/abr.cpp 中会话共享状态的同步机制进行了调整:将 sessions_mutexsessions 从函数/命名空间静态对象改为由 new 在进程生命周期内分配并永不释放;并在 llm_worker 的日志点处新增注释,说明与 BOOST_LOG 去初始化相关的潜在 use-after-free 场景。

Changes

会话状态生命周期管理

Layer / File(s) Summary
会话互斥锁和容器的动态分配
src/abr.cpp
sessions_mutexsessions 从函数/命名空间静态变量改为通过 new 动态分配并以引用持有且不释放,避免进程关闭阶段的静态析构顺序导致 detached LLM 工作线程访问已析构对象。
llm_worker 日志点注释
src/abr.cpp
llm_worker 中加锁后的 BOOST_LOG 周围新增注释,描述全局日志去初始化与晚到 worker 记录日志之间的潜在 use-after-free 场景及其当前的极窄竞态窗口。

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed 标题准确概括了主要改动——将LLM会话静态状态改为进程生命周期以避免关机时的use-after-free崩溃。
Description check ✅ Passed 描述详细说明了问题根源、修复方案和验证步骤,与changeset完全相关,清晰表达了改动的意图。
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/abr-llm-worker-static-lifetime

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/abr.cpp`:
- Around line 38-44: The current fix of allocating sessions_mutex and sessions
via *new prevents their destruction but does not prevent UAFs caused by detached
llm_worker threads calling BOOST_LOG (e.g., from parse_llm_response) during
static teardown when the global severity loggers
(verbose/debug/info/warning/error/fatal and tests) are already destroyed; either
(A) convert those global Boost loggers in src/logging.cpp to process-lifetime
objects (allocate them with *new or otherwise ensure they are not destroyed at
static teardown) so they remain valid for late workers, or (B) add a shutdown
atomic flag (e.g., a global atomic<bool> shutdown_in_progress) that llm_worker
and parse_llm_response check and skip all BOOST_LOG and other global access when
set; whichever approach you pick, ensure logging::deinit() is only called after
all workers have stopped and reference the symbols sessions_mutex, sessions,
parse_llm_response, llm_worker, BOOST_LOG, and logging::deinit in your changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: faaca70d-e1d1-453f-81cd-c3f189406ede

📥 Commits

Reviewing files that changed from the base of the PR and between 5707925 and 2a113e0.

📒 Files selected for processing (1)
  • src/abr.cpp
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Windows
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.{cpp,c,h}

⚙️ CodeRabbit configuration file

src/**/*.{cpp,c,h}: Sunshine 核心 C++ 源码,自托管游戏串流服务器。审查要点:内存安全、 线程安全、RAII 资源管理、安全漏洞。注意预处理宏控制的平台相关代码。

Files:

  • src/abr.cpp

Comment thread src/abr.cpp
Comment on lines +38 to +44
// The LLM worker runs on a detached thread and may still be in flight (its HTTP
// call has a 300s timeout) when the process begins shutting down. To avoid a
// static-destruction-order crash — where the worker reacquires the lock and
// touches these globals after they have been destroyed — both are allocated with
// process lifetime and intentionally never freed, so a late worker is always safe.
static std::mutex &sessions_mutex = *new std::mutex();
static std::unordered_map<std::string, session_state_t> &sessions = *new std::unordered_map<std::string, session_state_t>();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 检查 logging.h / logging.cpp 中 info/warning 等 severity logger 的定义与生命周期
fd -i 'logging.h' --exec sed -n '1,160p' {}
echo "----- logging.cpp -----"
fd -i 'logging.cpp' --exec sed -n '1,160p' {}
# 定位全局 logger 的定义/声明形式(命名空间作用域全局 vs 进程生命周期)
rg -nP 'severity_logger|BOOST_LOG_INLINE_GLOBAL_LOGGER|BOOST_LOG_GLOBAL_LOGGER|extern\s+.*logger' --iglob '*logging*'

Repository: AlkaidLab/foundation-sunshine

Length of output: 12903


src/abr.cpp:仅延长 sessions_mutex/sessions 生命周期不足,关机期仍可能因 BOOST_LOG 触发 UAF

  • 虽然通过 *new 泄漏让 sessions_mutex/sessions 获得进程生命周期,但同一 detached llm_worker 在关机过程中仍可能执行 BOOST_LOG(含 parse_llm_response 内部日志)。
  • BOOST_LOG 依赖的 severity logger 在 src/logging.cpp 中是命名空间作用域的全局对象(verbose/debug/info/warning/error/fatal,以及 SUNSHINE_TESTS 下的 tests),会在静态析构阶段被销毁;因此关机期仍可能访问到已析构的日志对象,UAF 风险未被完全消除。

建议把这些 logger 也改为进程生命周期(与当前 *new 方式一致),或在 worker 增加关机退出原子标志,关机后跳过所有 BOOST_LOG 与相关全局访问;同时确保 logging::deinit() 发生在所有相关 worker 停止之后。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/abr.cpp` around lines 38 - 44, The current fix of allocating
sessions_mutex and sessions via *new prevents their destruction but does not
prevent UAFs caused by detached llm_worker threads calling BOOST_LOG (e.g., from
parse_llm_response) during static teardown when the global severity loggers
(verbose/debug/info/warning/error/fatal and tests) are already destroyed; either
(A) convert those global Boost loggers in src/logging.cpp to process-lifetime
objects (allocate them with *new or otherwise ensure they are not destroyed at
static teardown) so they remain valid for late workers, or (B) add a shutdown
atomic flag (e.g., a global atomic<bool> shutdown_in_progress) that llm_worker
and parse_llm_response check and skip all BOOST_LOG and other global access when
set; whichever approach you pick, ensure logging::deinit() is only called after
all workers have stopped and reference the symbols sessions_mutex, sessions,
parse_llm_response, llm_worker, BOOST_LOG, and logging::deinit in your changes.

Add comments clarifying two intentional design limitations surfaced in
review:
- sessions are keyed by client_name; concurrent same-name sessions would
  collide. Acceptable: single-session is immune and paired devices have
  distinct names; a full fix needs a coordinated session-id rekey across
  resolve_client, this map, and change_dynamic_param_for_client.
- the detached worker's BOOST_LOG calls touch globals destroyed at static
  teardown, but the session-existence/generation guard makes the worker
  return before logging on shutdown; the residual race is acceptable.
@qiin2333 qiin2333 merged commit 3a2aa5b into master May 29, 2026
4 checks passed
@qiin2333 qiin2333 deleted the fix/abr-llm-worker-static-lifetime branch May 29, 2026 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant