Skip to content

Fix wake_by_ref() weak memory issue #7632

@frostyplanet

Description

@frostyplanet

Version
tokio 1.47.1

Platform
github runner macos-15

rustc 1.89.0 (29483883e 2025-08-04)

Default host: aarch64-apple-darwin
rustup home:  /Users/runner/.rustup

installed toolchains
--------------------
stable-aarch64-apple-darwin (active, default)

active toolchain
----------------
name: stable-aarch64-apple-darwin
active because: it's the default toolchain
installed targets:
  aarch64-apple-darwin

Description

There is still dead lock after issue #7589, calling waker_by_ref from another thread, does not wake up the task. So I added more debug log and came up with a patch. (tracked by frostyplanet/crossfire-rs#39)

The test code
https://github.com/frostyplanet/crossfire-rs/blob/ae7a430416ea6488f698454c5ebdda050ba126c1/src/tests/test_async.rs#L857
case(mpmc::bounded_async::(1)

In this scenario, there are 2 senders (one async task and one thread), 2 receivers (one async task and one thread) .
hang at lease three times while running under a current-thread tokio runtime.

The debug log was added inside tokio:
https://github.com/frostyplanet/tokio/compare/5e509ca7dc75112f2a6e2840c364f07b2e7f691d..6a96ec96cecbef5ba37a82a196501ffa14f25021

Original log can be downloaded from https://github.com/frostyplanet/crossfire-rs/actions/runs/17850277588/job/50757354826
The last part of the log before hang.

[2025-09-19 06:45:00.469416][DEBUG][ThreadId(2)][async_rx.rs:230] rxSome(Id(2)): recv waker(8227)
[2025-09-19 06:45:00.469418][DEBUG][ThreadId(2)][waker_registry.rs:209] rxSome(Id(2)): reg waker(8228)
[2025-09-19 06:45:00.469419][DEBUG][ThreadId(2)][async_rx.rs:265] rxSome(Id(2)): commit_waiting waker(8228) 1
[2025-09-19 06:45:00.469419][DEBUG][ThreadId(2)][harness.rs:219] task 2 transition_to_idle Ok
[2025-09-19 06:45:00.469448][DEBUG][ThreadId(6)][waker_registry.rs:209] rxNone: reg waker(8229)
[2025-09-19 06:45:00.469450][DEBUG][ThreadId(6)][blocking_rx.rs:133] rx: waker(8229) commit_waiting state=1
[2025-09-19 06:45:00.469454][DEBUG][ThreadId(5)][state.rs:271] waker.wake_by_ref (idle)
[2025-09-19 06:45:00.469456][DEBUG][ThreadId(5)][mod.rs:661] task Id(2) outside push
[2025-09-19 06:45:00.469457][DEBUG][ThreadId(5)][waker_registry.rs:392] wake rx waker(8228) Waked
[2025-09-19 06:45:00.469460][DEBUG][ThreadId(5)][waker_registry.rs:52] txNone: reg waker(9253)
[2025-09-19 06:45:00.469460][DEBUG][ThreadId(5)][blocking_tx.rs:149] tx: sender_reg_and_try Some(waker(9253)) state=1

Thread 5 is a sender in blocking context, it tried to wake task Id(2) by waker(8228). We can see task Id(2) transit from idle to notified.

[2025-09-19 06:45:00.469454][DEBUG][ThreadId(5)][state.rs:271] waker.wake_by_ref (idle)
[2025-09-19 06:45:00.469456][DEBUG][ThreadId(5)][mod.rs:661] task Id(2) outside push
[2025-09-19 06:45:00.469457][DEBUG][ThreadId(5)][waker_registry.rs:392] wake rx waker(8228) Waked

And then pushed to schedule. My guess is that the worker thread will unpark, but saw nothing to run.

src/runtime/scheduler/current_thread/mod.rs

    fn schedule(&self, task: task::Notified<Self>) {                                                                        
        use scheduler::Context::CurrentThread;                                                                              
    
        context::with_scheduler(|maybe_cx| match maybe_cx {                                                                 
            Some(CurrentThread(cx)) if Arc::ptr_eq(self, &cx.handle) => {                                                   
                let mut core = cx.core.borrow_mut();
    
                // If `None`, the runtime is shutting down, so there is no need                                             
                // to schedule the task.                                                                                    
                if let Some(core) = core.as_mut() {
                    tracing::debug!("task {:?} push cx", task.task_id());
                    core.push_task(self, task);
                } else {
                    tracing::debug!("task {:?} push cx None", task.task_id());                                              
                }
            }   
            _ => {
                // Track that a task was scheduled from **outside** of the runtime.                                         
                self.shared.scheduler_metrics.inc_remote_schedule_count();                                                  
                tracing::debug!("task {:?} outside push", task.task_id());                                                  
                // Schedule the task                                                                                        
                self.shared.inject.push(task);
                self.driver.unpark();                                                                                       
            }
        }); 
    }

In src/runtime/scheduler/inject.rs, push() is protected with lock, but pop() uses is_empty() before lock.

    /// Pushes a value into the queue.
    ///
    /// This does nothing if the queue is closed.
    pub(crate) fn push(&self, task: task::Notified<T>) {
        let mut synced = self.synced.lock();
        // safety: passing correct `Synced`
        unsafe { self.shared.push(&mut synced, task) }
    }

    pub(crate) fn pop(&self) -> Option<task::Notified<T>> {
        if self.shared.is_empty() {
            return None;
        }

        let mut synced = self.synced.lock();
        // safety: passing correct `Synced`
        unsafe { self.shared.pop(&mut synced) }
    }

In src/runtime/scheduler/inject/shared.rs, len() is using Acquire, and push is using Release.
So I think the problem is here, because I had a similar piece of code which reported a deadlock by Miri.
frostyplanet/crossfire-rs@0e790a6

Although Miri emulates weak memory by delaying the effect on load() and store(),
and Arm is other-multi-copy arch, from which load and store effects are immediate.
My theory is that it's possible self.driver.unpark() to be re-order inside the mutex scope, wake up the worker before setting len, therefore pop() might not see the change in len.

So I have changed load() and store() to SeqCst, the tests passed.
The patch:
frostyplanet@2fb5aaf
Test results:
https://github.com/frostyplanet/crossfire-rs/actions/runs/17851772944
https://github.com/frostyplanet/crossfire-rs/actions/runs/17852889652
(NOTE: #7622 was opt-out from the dep in the workflow, in order to reproduce hang more frequently)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-tokioArea: The main tokio crateC-bugCategory: This is a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions