refactor signal infrastructure and support SIGINT (ctrl-c)#690
refactor signal infrastructure and support SIGINT (ctrl-c)#690
Conversation
wdcui
left a comment
There was a problem hiding this comment.
Overall, the code change looks good to me. I left some comments below.
| /// | ||
| /// The default implementation simply calls `f()` with no additional setup. | ||
| /// Platforms that require explicit thread registration should override this. | ||
| fn run_test_thread<R>(f: impl FnOnce() -> R) -> R { |
There was a problem hiding this comment.
Should we scope this function to tests only?
There was a problem hiding this comment.
I thought about it. Using cfg(test) does not work here because the tests reside in a different crate (litebox_shim_linux). Test functions cannot be used by a different crate. One possible solution is adding a new feature, but that adds complexity.
There was a problem hiding this comment.
Actually, we can use cfg(debug_assertions).
| thread.interrupt(); | ||
| } | ||
|
|
||
| fn run_test_thread<R>(f: impl FnOnce() -> R) -> R { |
There was a problem hiding this comment.
Added cfg(debug_assertions)
| /// Runs `f`, ensuring that [`CURRENT_THREAD_HANDLE`] is set while in the call to `f`. | ||
| fn run_with_handle<R>(tls: &TlsState, f: impl FnOnce() -> R) -> R { | ||
| // Safety: `tls_state` lives for the duration of this call. | ||
| unsafe { install_tls(tls) }; |
There was a problem hiding this comment.
All threads should install tls. Previously install_tls was called in the caller function, I moved it here because I want to reuse this function for test threads. So now threads including guest threads and test threads would install tls properly.
| .unwrap() | ||
| .retain(|h| !Arc::ptr_eq(&h.0, ¤t.0)); | ||
| *current.0.lock().unwrap() = None; | ||
| let tls_index = TLS_INDEX.load(Ordering::Relaxed); |
There was a problem hiding this comment.
Should we create a function to uninstall_tls?
| // Safety: the TLS pointer is valid as long as the thread is | ||
| // alive, and we hold the thread handle lock. | ||
| let tls = unsafe { &*inner.tls.0 }; | ||
| tls.pending_host_signals.fetch_or(bit, Ordering::SeqCst); |
There was a problem hiding this comment.
What's the effect of this fetch_or?
There was a problem hiding this comment.
Performs a bitwise “or” operation on the current value and the argument val, and sets the new value to the result.
Returns the previous value.
wdcui
left a comment
There was a problem hiding this comment.
LGTM. I left some comments below.
|
🤖 SemverChecks 🤖 No breaking API changes detected Note: this does not mean API is unchanged, or even that there are no breaking changes; simply, none of the detections triggered. |
| /// Signal numbers are 1-based and must be in the range 1–63. | ||
| #[derive(Copy, Clone, Debug, Eq, PartialEq)] | ||
| pub struct Signal(u32); |
| /// A set of [`Signal`]s, stored as a 64-bit bitmask. | ||
| /// | ||
| /// Bit `(signum - 1)` is set when signal `signum` is present in the set. | ||
| /// Because signal numbers are 1-based and capped at 63, all 63 possible | ||
| /// signals fit in a single `u64`. | ||
| #[derive(Clone, Debug, Eq, PartialEq)] | ||
| pub struct SigSet(u64); |
There was a problem hiding this comment.
If you are explicitly defining the internal details of how things are stored, you can just expose the value inside it via pub struct SigSet(pub u64) to remove from_u64/as_u64. Alternatively, you should not be specifying how it is being internally stored in the docs.
| /// Provider for consuming platform-originating signals. | ||
| /// | ||
| /// Platforms can record signals (e.g., `SIGINT`) and the shim should call | ||
| /// [`SignalProvider::take_pending_signals`] to consume them. | ||
| pub trait SignalProvider { | ||
| /// Atomically take all pending asynchronous signals (e.g., SIGINT and SIGALRM) | ||
| /// for the current thread, passing each one to `f`. | ||
| /// | ||
| /// Platforms that support asynchronous signals should override this method. | ||
| #[allow(unused_variables, reason = "no-op by default")] | ||
| fn take_pending_signals(&self, f: impl FnMut(crate::shim::Signal)) {} |
There was a problem hiding this comment.
I am quite surprised by this provider. There are some surprising inversions happening here, for example, it depends on crate::shim in its signature.
I don't think the Signal sits well in shim, since the shim module is meant for traits that the shim must implement but there isn't a trait being implemented there.
Also, "shim should call" implies that a shim that does not call stuff is incorrect. This sort of requirement is exactly what traits are for, and thus the way the design should be is to actually have the shim implement a trait that probably looks like the following:
pub trait ConsumeSignal {
fn consume_signal(Signal);
}
Then the platform can always invoke this trait, and the shim can make a decision of it wants to buffer things or not. If you need/want the "all in a group" behavior, then you can make the trait (instead of the above):
pub trait ConsumeSignal {
fn consume_signals(&[Signal]);
}
| /// for the current thread, passing each one to `f`. | ||
| /// | ||
| /// Platforms that support asynchronous signals should override this method. | ||
| #[allow(unused_variables, reason = "no-op by default")] |
There was a problem hiding this comment.
Nit: prefer expect rather than allow
| /// The passed `waker` should live at least until the thread leaves the interruptible | ||
| /// wait (i.e., [`on_interruptible_wait_end`](Self::on_interruptible_wait_end) is called). |
There was a problem hiding this comment.
This is a somewhat surprising requirement; we can simplify this by updating the signature in this to take a Waker rather than a &Waker, right?
| pub trait RawMutexProvider { | ||
| type RawMutex: RawMutex; | ||
|
|
||
| /// Called when a thread enters an interruptible wait. |
| /// | ||
| /// The passed `waker` should live at least until the thread leaves the interruptible | ||
| /// wait (i.e., [`on_interruptible_wait_end`](Self::on_interruptible_wait_end) is called). | ||
| /// The platform can use the `waker` to wake up the thread while it is in the interruptible wait. |
| /// | ||
| /// This is a no-op by default. | ||
| #[allow(unused_variables)] | ||
| fn on_interruptible_wait_start(&self, waker: &crate::event::wait::Waker<Self>) |
There was a problem hiding this comment.
The name of this function is somewhat weird. It is not actually beginning the wait, right? Maybe something like register_waker (and the other one should be unregister_waker)?
Also, what is the expected behavior if someone runs "start" multiple times in a row?
| /// Called when a thread leaves an interruptible wait. | ||
| /// | ||
| /// This is a no-op by default. | ||
| fn on_interruptible_wait_end(&self) {} |
There was a problem hiding this comment.
Can merge with the above one by taking in a Option<Waker> instead of a Waker; in that case, it is behaving more like update_interruptible_waker.
| impl TryFrom<Signal> for litebox::shim::Signal { | ||
| type Error = Signal; | ||
|
|
||
| fn try_from(value: Signal) -> Result<Self, Self::Error> { | ||
| match value { | ||
| Signal::SIGINT => Ok(Self::SIGINT), | ||
| Signal::SIGALRM => Ok(Self::SIGALRM), | ||
| _ => Err(value), |
There was a problem hiding this comment.
This is quite surprising, haven't you designed the litebox::shim::Signal literally to support the full set of Linux signals? You've even hard-coded the "up to 64" and so on, so I see no reason why this does not support SIGILL etc converted over.
I do think that the long-term move is for litebox::shim::Signal to not have this 64 restriction, but if you are assuming that right now, then wouldn't the correct move just be Signal::from_raw(...).ok_or(Err(...))?
This PR supports async signal SIGINT.
On Linux, the workflow is the following:
prepare_to_run_guest) --> check TLS and queue signals --> process signal3.1. before commit to wait,
check_for_interruptwould check TLS and queue signals --> process signal3.2. during wait, wake up and go to 3.1
3.3. after waiting, got to 3.1
For shim to be able to retrieve pending signals from platform, add a new
SignalProviderwith the interfacefn take_pending_signals(&self, mut f: impl FnMut(litebox::shim::Signal)).One tricky problem is that when a signal arrives, the running thread may be about to be blocked or blocked already (i.e., case 3.1 and 3.2), and we didn't interrupt a thread if it is running inside LiteBox code. In which case, we never have the chance to process that signal. It might be possible to examine the RIP of the thread in the signal handler and divert the execution to skip mutex wait call it if RIP is close to it (e.g., use assembly code to build a critical region), but this may not work on Windows (where we don't call syscall instruction directly). The current solution I have is based on the observation that for all blocking operations (e.g., futex wait, sleep) boil down to
RawMutex::block_or_timeoutwith the per-threadWaitStateand the expected valueThreadState::WAITING.WaitStatestores the thread status likeRUNNING_IN_HOST,WAITING,WOKEN, and etc. Thus, in the signal handler, besides recording the signal, we can update the thread state fromWAITINGtoWOKENand then callsfutex wake. If the thread is about to call futex wait, futex call would return immediately because of the change of the value. If the thread is being blocked, it would wake up and return. To accessWaitStatein the signal handler, I added two new interfaceson_interruptible_wait_startandon_interruptible_wait_endtoRawMutexProviderso that platform can useWakerto wake up the blocking thread.The signal handlers for SIGINT can be called on any threads. However, we don't want non-guest threads (e.g., background network worker, additional test thread spawned by
cargo test) to be called with the signal. For background network worker, we can block those signals. For test thread, the signal handler would check if it has our custom TLS. If not, it blocks those signals and raise them again.On Windows, the implementation is almost the same except that signal handler is called on a kernel thread (and thus cannot access litebox thread's TLS easily). I have to use a global variable to record all active threads.
I added some end-to-end C tests and Rust unit tests. To make Rust unit test work, I also introduced a new interface
fn run_test_thread<R>(f: impl FnOnce() -> R) -> RtoThreadProvider. This is because spawning a guest thread (viasys_clone) and a regular thread in test (via std::thread::spawn) have different routes and require different setup. For example, while running in test, there is no guest code, and thus we don't need to do fs/gs swap (hence no setup for one of fs/gs). To make the code work for both guest thread and non-guest thread, we can set gs == fs before running the test thread.