bugfix: fix instance manager dead lock and RPC holding locks for a long time.#50
Conversation
There was a problem hiding this comment.
Code Review
The pull request refactors the InstanceMgr class to improve concurrency and prevent deadlocks by introducing a new two-level locking scheme (cluster_mutex_ and metrics_mutex_) and ensuring RPC calls are made outside of critical sections. This involved consolidating multiple mutexes, modifying init, upload_load_metrics, reconcile_instance_states, update_request_metrics, select_instance_pair_on_slo, register_instance, and deregister_instance to adhere to the new locking strategy and separate lock-holding from external operations. A critical issue was identified in the refactored deregister_instance function, where a race condition could lead to data corruption if an instance is concurrently deregistered and re-registered, as the function fails to re-verify the instance's incarnation ID after re-acquiring the cluster_mutex_.
No description provided.