Skip to content

bugfix: fix instance manager dead lock and RPC holding locks for a long time.#50

Merged
magicheng0816 merged 1 commit into
jd-opensource:mainfrom
magicheng0816:fix_instance_lock
Mar 25, 2026
Merged

bugfix: fix instance manager dead lock and RPC holding locks for a long time.#50
magicheng0816 merged 1 commit into
jd-opensource:mainfrom
magicheng0816:fix_instance_lock

Conversation

@magicheng0816
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request refactors the InstanceMgr class to improve concurrency and prevent deadlocks by introducing a new two-level locking scheme (cluster_mutex_ and metrics_mutex_) and ensuring RPC calls are made outside of critical sections. This involved consolidating multiple mutexes, modifying init, upload_load_metrics, reconcile_instance_states, update_request_metrics, select_instance_pair_on_slo, register_instance, and deregister_instance to adhere to the new locking strategy and separate lock-holding from external operations. A critical issue was identified in the refactored deregister_instance function, where a race condition could lead to data corruption if an instance is concurrently deregistered and re-registered, as the function fails to re-verify the instance's incarnation ID after re-acquiring the cluster_mutex_.

Comment thread xllm_service/scheduler/managers/instance_mgr.cpp
Copy link
Copy Markdown
Collaborator

@yq33victor yq33victor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@magicheng0816 magicheng0816 merged commit 495356b into jd-opensource:main Mar 25, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants