Skip to content

Conversation

wenyongh
Copy link
Contributor

Use the shared memory's shared_mem_lock to lock the whole atomic.wait and
atomic.notify processes, and use it for os_cond_reltimedwait and os_cond_notify,
so as to make the whole processes actual atomic operations:
the original implementation accesses the wait address with shared_mem_lock
and uses wait_node->wait_lock for os_cond_reltimedwait, which is not atomic
operation.

And remove the unnecessary wait_map_lock and wait_lock, since the whole
processes are already locked by shared_mem_lock.

@g0djan
Copy link
Contributor

g0djan commented Mar 22, 2023

Thank you @wenyongh !

Is this PR about the hanging problem?
Because I just tried it and it still hangs exactly the same, but if I use wasi-libc with your fix it doesn't hang anymore on this branch but it doesn't hang on the main branch either.

Also I opened an issue about it in wasi-libc with my investigation, but since you made PRs I suppose you discovered the same things

@wenyongh
Copy link
Contributor Author

Thank you @wenyongh !

Is this PR about the hanging problem? Because I just tried it and it still hangs exactly the same, but if I use wasi-libc with your fix it doesn't hang anymore on this branch but it doesn't hang on the main branch either.

The root cause of the hang is that the opcodes generated of a_store is not atomic for interpreter mode, as you mentioned, I submitted a PR to fix it in wasi-libc.
The PR fixes a potential issue of WAMR atomic.wait implementation, which is not atomic in fact: a) we use shared_memory_wait to get *address to check whether need to wait, and then b) use wait_lock to perform os_cond_reltimedwait. If another thread changes *address between a) and b), then the original thread will still run into wait which might be indefinite and hang. After changing, we use shared_memory_lock to lock the whole process (both a) and b) to make it atmoic.

Also I opened an issue about it in wasi-libc with my investigation, but since you made PRs I suppose you discovered the same things

The data races seems to be normal behavior, the a_swap operation in wasi-libc is like:
https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/src/internal/atomic.h#L106-L115

	int old;
	do old = *p;
	while (a_cas(p, old, v) != old);
	return old;

Here old = *p is not atomic, and a_cas(p, old, v) is atomic, if two threads operate on the same address p, data races may be reported by sanitizer.

@g0djan
Copy link
Contributor

g0djan commented Mar 22, 2023

If another thread changes *address between a) and b), then the original thread will still run into wait which might be indefinite and hang.

Gotcha, make sense. I will ran tests with TSAN and multiple times as I did to find hanging situations

Here old = *p is not atomic, and a_cas(p, old, v) is atomic, if two threads operate on the same address p, data races may be reported by sanitizer.

Hm, isn't it needed to be fixed then? What's the benefit to keep it that way?

@wenyongh wenyongh merged commit 49d439a into bytecodealliance:main Mar 23, 2023
@g0djan
Copy link
Contributor

g0djan commented Mar 23, 2023

Multiple runs and running under TSAN passed 👍

@wenyongh wenyongh deleted the fix_interp_atomic_opcodes branch March 24, 2023 06:06
vickiegpt pushed a commit to vickiegpt/wamr-aot-gc-checkpoint-restore that referenced this pull request May 27, 2024
…#2044)

Use the shared memory's shared_mem_lock to lock the whole atomic.wait and
atomic.notify processes, and use it for os_cond_reltimedwait and os_cond_notify,
so as to make the whole processes actual atomic operations:
the original implementation accesses the wait address with shared_mem_lock
and uses wait_node->wait_lock for os_cond_reltimedwait, which is not an atomic
operation.

And remove the unnecessary wait_map_lock and wait_lock, since the whole
processes are already locked by shared_mem_lock.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants