-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Version
tokio 1.47.1
Platform
github runner macos-15
rustc 1.89.0 (29483883e 2025-08-04)
Default host: aarch64-apple-darwin
rustup home: /Users/runner/.rustup
installed toolchains
--------------------
stable-aarch64-apple-darwin (active, default)
active toolchain
----------------
name: stable-aarch64-apple-darwin
active because: it's the default toolchain
installed targets:
aarch64-apple-darwin
Description
There is still dead lock after issue #7589, calling waker_by_ref from another thread, does not wake up the task. So I added more debug log and came up with a patch. (tracked by frostyplanet/crossfire-rs#39)
The test code
https://github.com/frostyplanet/crossfire-rs/blob/ae7a430416ea6488f698454c5ebdda050ba126c1/src/tests/test_async.rs#L857
case(mpmc::bounded_async::(1)
In this scenario, there are 2 senders (one async task and one thread), 2 receivers (one async task and one thread) .
hang at lease three times while running under a current-thread tokio runtime.
The debug log was added inside tokio:
https://github.com/frostyplanet/tokio/compare/5e509ca7dc75112f2a6e2840c364f07b2e7f691d..6a96ec96cecbef5ba37a82a196501ffa14f25021
Original log can be downloaded from https://github.com/frostyplanet/crossfire-rs/actions/runs/17850277588/job/50757354826
The last part of the log before hang.
[2025-09-19 06:45:00.469416][DEBUG][ThreadId(2)][async_rx.rs:230] rxSome(Id(2)): recv waker(8227)
[2025-09-19 06:45:00.469418][DEBUG][ThreadId(2)][waker_registry.rs:209] rxSome(Id(2)): reg waker(8228)
[2025-09-19 06:45:00.469419][DEBUG][ThreadId(2)][async_rx.rs:265] rxSome(Id(2)): commit_waiting waker(8228) 1
[2025-09-19 06:45:00.469419][DEBUG][ThreadId(2)][harness.rs:219] task 2 transition_to_idle Ok
[2025-09-19 06:45:00.469448][DEBUG][ThreadId(6)][waker_registry.rs:209] rxNone: reg waker(8229)
[2025-09-19 06:45:00.469450][DEBUG][ThreadId(6)][blocking_rx.rs:133] rx: waker(8229) commit_waiting state=1
[2025-09-19 06:45:00.469454][DEBUG][ThreadId(5)][state.rs:271] waker.wake_by_ref (idle)
[2025-09-19 06:45:00.469456][DEBUG][ThreadId(5)][mod.rs:661] task Id(2) outside push
[2025-09-19 06:45:00.469457][DEBUG][ThreadId(5)][waker_registry.rs:392] wake rx waker(8228) Waked
[2025-09-19 06:45:00.469460][DEBUG][ThreadId(5)][waker_registry.rs:52] txNone: reg waker(9253)
[2025-09-19 06:45:00.469460][DEBUG][ThreadId(5)][blocking_tx.rs:149] tx: sender_reg_and_try Some(waker(9253)) state=1
Thread 5 is a sender in blocking context, it tried to wake task Id(2) by waker(8228). We can see task Id(2) transit from idle to notified.
[2025-09-19 06:45:00.469454][DEBUG][ThreadId(5)][state.rs:271] waker.wake_by_ref (idle)
[2025-09-19 06:45:00.469456][DEBUG][ThreadId(5)][mod.rs:661] task Id(2) outside push
[2025-09-19 06:45:00.469457][DEBUG][ThreadId(5)][waker_registry.rs:392] wake rx waker(8228) Waked
And then pushed to schedule. My guess is that the worker thread will unpark, but saw nothing to run.
src/runtime/scheduler/current_thread/mod.rs
fn schedule(&self, task: task::Notified<Self>) {
use scheduler::Context::CurrentThread;
context::with_scheduler(|maybe_cx| match maybe_cx {
Some(CurrentThread(cx)) if Arc::ptr_eq(self, &cx.handle) => {
let mut core = cx.core.borrow_mut();
// If `None`, the runtime is shutting down, so there is no need
// to schedule the task.
if let Some(core) = core.as_mut() {
tracing::debug!("task {:?} push cx", task.task_id());
core.push_task(self, task);
} else {
tracing::debug!("task {:?} push cx None", task.task_id());
}
}
_ => {
// Track that a task was scheduled from **outside** of the runtime.
self.shared.scheduler_metrics.inc_remote_schedule_count();
tracing::debug!("task {:?} outside push", task.task_id());
// Schedule the task
self.shared.inject.push(task);
self.driver.unpark();
}
});
}
In src/runtime/scheduler/inject.rs, push() is protected with lock, but pop() uses is_empty() before lock.
/// Pushes a value into the queue.
///
/// This does nothing if the queue is closed.
pub(crate) fn push(&self, task: task::Notified<T>) {
let mut synced = self.synced.lock();
// safety: passing correct `Synced`
unsafe { self.shared.push(&mut synced, task) }
}
pub(crate) fn pop(&self) -> Option<task::Notified<T>> {
if self.shared.is_empty() {
return None;
}
let mut synced = self.synced.lock();
// safety: passing correct `Synced`
unsafe { self.shared.pop(&mut synced) }
}
In src/runtime/scheduler/inject/shared.rs, len() is using Acquire, and push is using Release.
So I think the problem is here, because I had a similar piece of code which reported a deadlock by Miri.
frostyplanet/crossfire-rs@0e790a6
Although Miri emulates weak memory by delaying the effect on load() and store(),
and Arm is other-multi-copy arch, from which load and store effects are immediate.
My theory is that it's possible self.driver.unpark()
to be re-order inside the mutex scope, wake up the worker before setting len
, therefore pop() might not see the change in len
.
So I have changed load() and store() to SeqCst, the tests passed.
The patch:
frostyplanet@2fb5aaf
Test results:
https://github.com/frostyplanet/crossfire-rs/actions/runs/17851772944
https://github.com/frostyplanet/crossfire-rs/actions/runs/17852889652
(NOTE: #7622 was opt-out from the dep in the workflow, in order to reproduce hang more frequently)