-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15604: [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing #12408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15604: [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing #12408
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to do this in a library? Applications or libraries using Arrow C++ may want to use their own context storage and will be surprised that Arrow overrides it. @lidavidm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this should generally be left to the application. For us, I suppose that's technically each individual test; that may get tedious, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should file an upstream issue? If we could access the upstream shared_ptr, we could keep references to it ourselves and be independent of what the application wants to do: https://github.com/open-telemetry/opentelemetry-cpp/blob/3a3bf25289901079534b1cabe14e9c4fb3b35968/api/include/opentelemetry/context/runtime_context.h#L154
A brief reproduction using a thread and TSAN might be enough to convince them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good points. I'll file an upstream issue and put this on hold for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for filing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the implemented it indeed. Maybe we can bump the bundled OpenTelemetry, or wait for a new release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they are releasing monthly. Let's just disable OT in our CI builds until the release is out and we can address this at that point.
cpp/src/arrow/util/thread_pool.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably deserve a more general approach. For example:
class Executor {
public:
class Resource {
public:
virtual ~Resource();
};
// All live executors should keep this object alive
static void KeepAlive(std::shared_ptr<Resource>);
};There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to what I think you were intending. This basically inverts the dependency so that OT has to call keepalive on the thread pools instead of the other way around. The implementation's KeepAlive method was not static though. Can you double check that I addressed your concerns here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this should generally be left to the application. For us, I suppose that's technically each individual test; that may get tedious, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should file an upstream issue? If we could access the upstream shared_ptr, we could keep references to it ourselves and be independent of what the application wants to do: https://github.com/open-telemetry/opentelemetry-cpp/blob/3a3bf25289901079534b1cabe14e9c4fb3b35968/api/include/opentelemetry/context/runtime_context.h#L154
A brief reproduction using a thread and TSAN might be enough to convince them.
This is not a fix for the issue. The proper fix is at #12408 but cannot be merged as we are waiting for an OT release with an upstream fix. This is just to disable TSAN testing of OT in the meantime so we don't hide other bugs that may crop up while we wait for the proper fix. We should revert this change as part of #12408 Closes #12491 from westonpace/bugfix/ARROW-15604--temporarily-disable-ot-in-ci Authored-by: Weston Pace <[email protected]> Signed-off-by: David Li <[email protected]>
|
I've made an attempt at addressing the PR feedback in preparation for the OT release. This should be pretty close to what we will use (I think only two lines will change once we can use their new API). The unit test is a little unreliable. If I run it on repeat I can trigger the failure much more reliably than running it a single time. I played around with a few different approaches but couldn't come up with a variation that failed very reliably. I'm not sure if we want to leave it in or just get rid of it and rely on catching this via TSAN were there to be a regression. |
3a02f28 to
effc3b3
Compare
|
We can keep the test, I think. Even if we plan to catch regressions via TSAN, presumably this will trigger it more reliably? |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good enough to me.
cpp/src/arrow/util/thread_pool.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, sorry :-)
| /// \brief Keeps a resource alive until all executor threads have terminated | |
| /// \brief Keep a resource alive until all executor threads have terminated |
cpp/src/arrow/util/thread_pool.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... does this need ARROW_EXPORT?
Quickly bump the version since it changes a few APIs we'll use (most notably for #11920). #11963 will also need updating, but the conda-forge packages need to be updated first. This does not include the fix needed for #12408, that will require another version bump. Closes #12516 from lidavidm/arrow-15789 Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>
|
What is the status of this? |
|
OpenTelemetry v1.3.0 just released with the necessary fix, so we should be able to update it and get this properly fixed now. |
… This results in use-after-free. This commit binds the OT storage lifetime to our thread pool threads so that the storage will not be destroyed until all threads have ended.
…n-sdist-test build" This reverts commit d59dbbc.
…OpenTracing" This reverts commit 6aa3070.
… to grabbing OT's static context instead of resetting it with our own.
effc3b3 to
4e6a7e7
Compare
|
@github-actions crossbow submit test-fedora-35-cpp test-ubuntu-20.04-cpp-* test-r-ubuntu-22.04 |
|
Revision: f9c069d Submitted crossbow builds: ursacomputing/crossbow @ actions-1844
|
|
I'm pretty sure the TSAN failure is unrelated. Just to be sure I ran this via TSAN locally and it passes fine. I'm going to proceed with merge. |
|
Benchmark runs are scheduled for baseline = 6240eae and contender = 681ede6. 681ede6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
The particular cause of the failure was:
It could be fixed with scoping:
However, I suspect this would quickly get tedious. Instead it seems we can control the lifetime of the OT runtime storage and bind it to our worker threads. In the future we may want to consider doing similar tricks to keep alive the memory pool, global thread pools, etc.