fix(abr): 让 LLM 会话静态状态拥有进程生命周期以避免关机崩溃#694
Conversation
…rash The LLM worker runs on a detached thread and its HTTP call has a 300s timeout, so it can still be in flight when the process starts shutting down. On return it reacquires sessions_mutex and touches the sessions map; if those file-scope statics were already destroyed, that is a static-destruction-order use-after-free that can crash on exit. Allocate sessions_mutex and sessions with process lifetime (intentionally never freed) so a late-returning worker is always safe. All call sites are unchanged (they bind to references).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Summary by CodeRabbitBug Fixes
Walkthrough
Changes会话状态生命周期管理
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/abr.cpp`:
- Around line 38-44: The current fix of allocating sessions_mutex and sessions
via *new prevents their destruction but does not prevent UAFs caused by detached
llm_worker threads calling BOOST_LOG (e.g., from parse_llm_response) during
static teardown when the global severity loggers
(verbose/debug/info/warning/error/fatal and tests) are already destroyed; either
(A) convert those global Boost loggers in src/logging.cpp to process-lifetime
objects (allocate them with *new or otherwise ensure they are not destroyed at
static teardown) so they remain valid for late workers, or (B) add a shutdown
atomic flag (e.g., a global atomic<bool> shutdown_in_progress) that llm_worker
and parse_llm_response check and skip all BOOST_LOG and other global access when
set; whichever approach you pick, ensure logging::deinit() is only called after
all workers have stopped and reference the symbols sessions_mutex, sessions,
parse_llm_response, llm_worker, BOOST_LOG, and logging::deinit in your changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: faaca70d-e1d1-453f-81cd-c3f189406ede
📒 Files selected for processing (1)
src/abr.cpp
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Windows
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.{cpp,c,h}
⚙️ CodeRabbit configuration file
src/**/*.{cpp,c,h}: Sunshine 核心 C++ 源码,自托管游戏串流服务器。审查要点:内存安全、 线程安全、RAII 资源管理、安全漏洞。注意预处理宏控制的平台相关代码。
Files:
src/abr.cpp
| // The LLM worker runs on a detached thread and may still be in flight (its HTTP | ||
| // call has a 300s timeout) when the process begins shutting down. To avoid a | ||
| // static-destruction-order crash — where the worker reacquires the lock and | ||
| // touches these globals after they have been destroyed — both are allocated with | ||
| // process lifetime and intentionally never freed, so a late worker is always safe. | ||
| static std::mutex &sessions_mutex = *new std::mutex(); | ||
| static std::unordered_map<std::string, session_state_t> &sessions = *new std::unordered_map<std::string, session_state_t>(); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 检查 logging.h / logging.cpp 中 info/warning 等 severity logger 的定义与生命周期
fd -i 'logging.h' --exec sed -n '1,160p' {}
echo "----- logging.cpp -----"
fd -i 'logging.cpp' --exec sed -n '1,160p' {}
# 定位全局 logger 的定义/声明形式(命名空间作用域全局 vs 进程生命周期)
rg -nP 'severity_logger|BOOST_LOG_INLINE_GLOBAL_LOGGER|BOOST_LOG_GLOBAL_LOGGER|extern\s+.*logger' --iglob '*logging*'Repository: AlkaidLab/foundation-sunshine
Length of output: 12903
src/abr.cpp:仅延长 sessions_mutex/sessions 生命周期不足,关机期仍可能因 BOOST_LOG 触发 UAF
- 虽然通过
*new泄漏让sessions_mutex/sessions获得进程生命周期,但同一 detachedllm_worker在关机过程中仍可能执行BOOST_LOG(含parse_llm_response内部日志)。 BOOST_LOG依赖的 severity logger 在src/logging.cpp中是命名空间作用域的全局对象(verbose/debug/info/warning/error/fatal,以及SUNSHINE_TESTS下的tests),会在静态析构阶段被销毁;因此关机期仍可能访问到已析构的日志对象,UAF 风险未被完全消除。
建议把这些 logger 也改为进程生命周期(与当前 *new 方式一致),或在 worker 增加关机退出原子标志,关机后跳过所有 BOOST_LOG 与相关全局访问;同时确保 logging::deinit() 发生在所有相关 worker 停止之后。
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/abr.cpp` around lines 38 - 44, The current fix of allocating
sessions_mutex and sessions via *new prevents their destruction but does not
prevent UAFs caused by detached llm_worker threads calling BOOST_LOG (e.g., from
parse_llm_response) during static teardown when the global severity loggers
(verbose/debug/info/warning/error/fatal and tests) are already destroyed; either
(A) convert those global Boost loggers in src/logging.cpp to process-lifetime
objects (allocate them with *new or otherwise ensure they are not destroyed at
static teardown) so they remain valid for late workers, or (B) add a shutdown
atomic flag (e.g., a global atomic<bool> shutdown_in_progress) that llm_worker
and parse_llm_response check and skip all BOOST_LOG and other global access when
set; whichever approach you pick, ensure logging::deinit() is only called after
all workers have stopped and reference the symbols sessions_mutex, sessions,
parse_llm_response, llm_worker, BOOST_LOG, and logging::deinit in your changes.
Add comments clarifying two intentional design limitations surfaced in review: - sessions are keyed by client_name; concurrent same-name sessions would collide. Acceptable: single-session is immune and paired devices have distinct names; a full fix needs a coordinated session-id rekey across resolve_client, this map, and change_dynamic_param_for_client. - the detached worker's BOOST_LOG calls touch globals destroyed at static teardown, but the session-existence/generation guard makes the worker return before logging on shutdown; the residual race is acceptable.
问题
ABR 的 LLM 决策跑在 detached 线程上(abr.cpp
process_feedback里std::thread(llm_worker, ...).detach()),而其 HTTP 调用的CURLOPT_TIMEOUT是 300 秒。因此进程开始关闭时,该 worker 可能仍在飞行(最长 5 分钟)。llm_worker在 curl 返回后会重新获取sessions_mutex并访问sessionsmap。如果此时这两个文件作用域静态对象已被销毁(静态析构顺序),就是一次 use-after-free,可能导致进程退出时崩溃。修复
把
sessions_mutex和sessions改为进程生命周期分配(故意不释放),这样即使有 worker 在关机后才返回,访问的也始终是有效内存。验证
ninja sunshine编译通过,abr.cpp.obj干净编译,sunshine.exe正常链接。