Skip to content

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented Mar 27, 2025

Rationale for this change

The lambda continuation in ReadaheadGenerator::AddMarkFinishedContinuation might have a bug in below ordering:

Thread1: check state_->finished.load() == true in 1
Thread2: state->MarkFinishedIfDone, state->final_future.MarkFinished
Thread1: state_->num_running.fetch_add(1) in 2
Thread1: state->final_future.MarkFinished

Which thread1 is a thread calling ReadaheadGenerator::operator(), thread2 is triggered by underlying async generator.

This is because finished and num_running are different atomic variables.

What changes are included in this PR?

using lock instead

Are these changes tested?

Covered by existing

Are there any user-facing changes?

Bugfix

@github-actions
Copy link

⚠️ GitHub issue #45953 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #45953 has no components, please add labels for components.

@mapleFU
Copy link
Member Author

mapleFU commented Mar 27, 2025

sidenote: a std::atomic<int64_t> num_running{0}; with atomic handling finished and num_running also fixes this case, however, we use lock to prove it's right firstly

@EnricoMi
Copy link
Contributor

@mapleFU I can confirm this fixes the Future issue!

Only the TwoLevelCacheWithExpirationTest.CleanupPeriodOk occurred across 100 test runs:
https://github.com/EnricoMi/arrow/actions/runs/14102683222/job/39502158145#step:10:972
https://github.com/EnricoMi/arrow/actions/runs/14102683222/job/39502158191#step:12:336

@mapleFU
Copy link
Member Author

mapleFU commented Mar 27, 2025

@pitrou @zanmato1984 This seems fixed the bug, however, lint disables <mutex> in async_generator.h. Why is this rule here? Should I use atomic and cas or using other way to bypass this problem?

@pitrou
Copy link
Member

pitrou commented Mar 27, 2025

Why is this rule here? Should I use atomic and cas or using other way to bypass this problem?

You can use arrow/util/mutex.h

@pitrou
Copy link
Member

pitrou commented Mar 27, 2025

FWIW, it is because of C++/CLI: #45810

@zanmato1984
Copy link
Contributor

@pitrou @zanmato1984 This seems fixed the bug, however, lint disables <mutex> in async_generator.h. Why is this rule here? Should I use atomic and cas or using other way to bypass this problem?

A related discussion about the lint rule and the deprecation of support for C++/CLI can be found here #45674 (comment).

And it turns out that we are good to get rid of it.

@zanmato1984
Copy link
Contributor

Why is this rule here? Should I use atomic and cas or using other way to bypass this problem?

You can use arrow/util/mutex.h

Right, this works too.

@mapleFU mapleFU marked this pull request as ready for review March 27, 2025 14:19
@github-actions
Copy link

⚠️ GitHub issue #45953 has no components, please add labels for components.

@mapleFU mapleFU changed the title GH-45953: [C++] DNM: Trying to use lock to fix bug in Generator GH-45953: [C++] Use lock to fix atomic bug in ReadaheadGenerator Mar 27, 2025
@github-actions
Copy link

⚠️ GitHub issue #45953 has no components, please add labels for components.

@mapleFU
Copy link
Member Author

mapleFU commented Mar 27, 2025

@pitrou @zanmato1984 would you mind take a look?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, two minor comments

std::atomic<bool> finished{false};
int num_running{0}; // GUARDED_BY(mu)
bool finished{false}; // GUARDED_BY(mu)
arrow::util::Mutex mu;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call this mutex.

Future<> final_future = Future<>::Make();
std::atomic<int> num_running{0};
std::atomic<bool> finished{false};
int num_running{0}; // GUARDED_BY(mu)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is GUARDED_BY an annotation for some kind of static analysis tool? Which one?

Copy link
Member Author

@mapleFU mapleFU Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just my code style, also related tool is https://clang.llvm.org/docs/ThreadSafetyAnalysis.html , I just follow same style here

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 27, 2025
@pitrou
Copy link
Member

pitrou commented Mar 27, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 740b1cf

Submitted crossbow builds: ursacomputing/crossbow @ actions-c21700c42d

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-meson GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice fix.

LGTM with one minor suggestion.

@mapleFU mapleFU requested a review from pitrou March 31, 2025 08:45
Comment on lines 802 to 804
if (!is_finished) {
++state_->num_running;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: add a comment to explain this

Suggested change
if (!is_finished) {
++state_->num_running;
}
if (!is_finished) {
// We're going to push to the queue below, but we need
// to update `num_running` while we're holding the lock.
++state_->num_running;
}

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, otherwise LGTM

@mapleFU
Copy link
Member Author

mapleFU commented Mar 31, 2025

Comment fixed, I'll merge this after all ci passes

@mapleFU mapleFU merged commit 19033bc into apache:main Mar 31, 2025
35 of 36 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Mar 31, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 19033bc.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

@mapleFU mapleFU deleted the GH-45953 branch April 10, 2025 15:16
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Apr 15, 2025
apache#45954)

### Rationale for this change

The lambda continuation in `ReadaheadGenerator::AddMarkFinishedContinuation` might have a bug in below ordering: 

```
Thread1: check state_->finished.load() == true in 1
Thread2: state->MarkFinishedIfDone, state->final_future.MarkFinished
Thread1: state_->num_running.fetch_add(1) in 2
Thread1: state->final_future.MarkFinished
```

Which thread1 is a thread calling `ReadaheadGenerator::operator()`, thread2 is triggered by underlying async generator.

This is because `finished` and `num_running` are different atomic variables.

### What changes are included in this PR?

using lock instead

### Are these changes tested?

Covered by existing

### Are there any user-facing changes?

Bugfix

* GitHub Issue: apache#45953

Authored-by: mwish <[email protected]>
Signed-off-by: mwish <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants