ARROW-15604: [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing #12408

westonpace · 2022-02-12T03:11:30Z

The particular cause of the failure was:

  void InputReceived(ExecNode* input, ExecBatch batch) override {
    EVENT(span_, "InputReceived", {{"batch.length", batch.length}});
    util::tracing::Span span;
    START_SPAN_WITH_PARENT(span, span_, "InputReceived",
                           {{"node.label", label()}, {"batch.length", batch.length}});

    DCHECK_EQ(input, inputs_[0]);

    bool did_push = producer_.Push(std::move(batch));
    if (!did_push) return;  // producer_ was Closed already

    if (input_counter_.Increment()) {
      // This will mark the exec plan finished
      Finish();
      // At this point the main thread is generally free to exit but span hasn't been destructed yet
    }
  }

It could be fixed with scoping:

  void InputReceived(ExecNode* input, ExecBatch batch) override {
    EVENT(span_, "InputReceived", {{"batch.length", batch.length}});
    {
      util::tracing::Span span;
      START_SPAN_WITH_PARENT(span, span_, "InputReceived",
                             {{"node.label", label()}, {"batch.length", batch.length}});

      DCHECK_EQ(input, inputs_[0]);

      bool did_push = producer_.Push(std::move(batch));
      if (!did_push) return;  // producer_ was Closed already
    }
    if (input_counter_.Increment()) {
      // This will mark the exec plan finished
      Finish();
    }
  }

However, I suspect this would quickly get tedious. Instead it seems we can control the lifetime of the OT runtime storage and bind it to our worker threads. In the future we may want to consider doing similar tricks to keep alive the memory pool, global thread pools, etc.

github-actions · 2022-02-12T03:11:48Z

https://issues.apache.org/jira/browse/ARROW-15604

cpp/src/arrow/util/tracing_internal.cc

docker-compose.yml

pitrou · 2022-02-12T10:36:06Z

cpp/src/arrow/util/tracing_internal.cc

Is it ok to do this in a library? Applications or libraries using Arrow C++ may want to use their own context storage and will be surprised that Arrow overrides it. @lidavidm

Right, this should generally be left to the application. For us, I suppose that's technically each individual test; that may get tedious, though.

Maybe we should file an upstream issue? If we could access the upstream shared_ptr, we could keep references to it ourselves and be independent of what the application wants to do: https://github.com/open-telemetry/opentelemetry-cpp/blob/3a3bf25289901079534b1cabe14e9c4fb3b35968/api/include/opentelemetry/context/runtime_context.h#L154

A brief reproduction using a thread and TSAN might be enough to convince them.

Ah, good points. I'll file an upstream issue and put this on hold for now.

open-telemetry/opentelemetry-cpp#1211

Thanks a lot for filing this.

Looks like the implemented it indeed. Maybe we can bump the bundled OpenTelemetry, or wait for a new release.

Looks like they are releasing monthly. Let's just disable OT in our CI builds until the release is out and we can address this at that point.

pitrou · 2022-02-12T10:41:01Z

cpp/src/arrow/util/thread_pool.cc

This would probably deserve a more general approach. For example:

class Executor { public: class Resource { public: virtual ~Resource(); }; // All live executors should keep this object alive static void KeepAlive(std::shared_ptr<Resource>); };

I switched to what I think you were intending. This basically inverts the dependency so that OT has to call keepalive on the thread pools instead of the other way around. The implementation's KeepAlive method was not static though. Can you double check that I addressed your concerns here?

cpp/src/arrow/util/tracing_internal.h

lidavidm · 2022-02-12T17:46:49Z

cpp/src/arrow/util/tracing_internal.cc

Right, this should generally be left to the application. For us, I suppose that's technically each individual test; that may get tedious, though.

lidavidm · 2022-02-12T17:49:45Z

cpp/src/arrow/util/tracing_internal.cc

Maybe we should file an upstream issue? If we could access the upstream shared_ptr, we could keep references to it ourselves and be independent of what the application wants to do: https://github.com/open-telemetry/opentelemetry-cpp/blob/3a3bf25289901079534b1cabe14e9c4fb3b35968/api/include/opentelemetry/context/runtime_context.h#L154

A brief reproduction using a thread and TSAN might be enough to convince them.

This is not a fix for the issue. The proper fix is at #12408 but cannot be merged as we are waiting for an OT release with an upstream fix. This is just to disable TSAN testing of OT in the meantime so we don't hide other bugs that may crop up while we wait for the proper fix. We should revert this change as part of #12408 Closes #12491 from westonpace/bugfix/ARROW-15604--temporarily-disable-ot-in-ci Authored-by: Weston Pace <[email protected]> Signed-off-by: David Li <[email protected]>

westonpace · 2022-02-23T19:06:43Z

I've made an attempt at addressing the PR feedback in preparation for the OT release. This should be pretty close to what we will use (I think only two lines will change once we can use their new API).

The unit test is a little unreliable. If I run it on repeat I can trigger the failure much more reliably than running it a single time. I played around with a few different approaches but couldn't come up with a variation that failed very reliably. I'm not sure if we want to leave it in or just get rid of it and rely on catching this via TSAN were there to be a regression.

lidavidm · 2022-02-24T14:50:49Z

We can keep the test, I think. Even if we plan to catch regressions via TSAN, presumably this will trigger it more reliably?

pitrou

This looks good enough to me.

pitrou · 2022-02-24T15:15:37Z

cpp/src/arrow/util/thread_pool.h

Nit, sorry :-)

Suggested change

/// \brief Keeps a resource alive until all executor threads have terminated

/// \brief Keep a resource alive until all executor threads have terminated

pitrou · 2022-02-24T15:16:20Z

cpp/src/arrow/util/thread_pool.h

Hmm... does this need ARROW_EXPORT?

Quickly bump the version since it changes a few APIs we'll use (most notably for #11920). #11963 will also need updating, but the conda-forge packages need to be updated first. This does not include the fix needed for #12408, that will require another version bump. Closes #12516 from lidavidm/arrow-15789 Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>

pitrou · 2022-03-22T16:42:48Z

What is the status of this?

lidavidm · 2022-04-13T00:32:05Z

OpenTelemetry v1.3.0 just released with the necessary fix, so we should be able to update it and get this properly fixed now.

… This results in use-after-free. This commit binds the OT storage lifetime to our thread pool threads so that the storage will not be destroyed until all threads have ended.

…n-sdist-test build" This reverts commit d59dbbc.

…OpenTracing" This reverts commit 6aa3070.

… to grabbing OT's static context instead of resetting it with our own.

lidavidm · 2022-04-13T14:12:00Z

@github-actions crossbow submit test-fedora-35-cpp test-ubuntu-20.04-cpp-* test-r-ubuntu-22.04

github-actions · 2022-04-13T16:20:50Z

Revision: f9c069d

Submitted crossbow builds: ursacomputing/crossbow @ actions-1844

Task	Status
test-fedora-35-cpp
test-r-ubuntu-22.04
test-ubuntu-20.04-cpp-14
test-ubuntu-20.04-cpp-17
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-thread-sanitizer

westonpace · 2022-04-13T20:15:15Z

I'm pretty sure the TSAN failure is unrelated. Just to be sure I ran this via TSAN locally and it passes fine. I'm going to proceed with merge.

ursabot · 2022-04-14T17:11:25Z

Benchmark runs are scheduled for baseline = 6240eae and contender = 681ede6. 681ede6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.38% ⬆️0.21%] test-mac-arm
[Failed ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.34% ⬆️0.09%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/503| 681ede6f ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/489| 681ede6f test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/490| 681ede6f ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/499| 681ede6f ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/502| 6240eae2 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/488| 6240eae2 test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/489| 6240eae2 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/498| 6240eae2 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

westonpace requested review from lidavidm and pitrou February 12, 2022 03:11

github-actions bot added the Component: C++ label Feb 12, 2022

cyb70289 reviewed Feb 12, 2022

View reviewed changes

cpp/src/arrow/util/tracing_internal.cc Outdated Show resolved Hide resolved

cyb70289 reviewed Feb 12, 2022

View reviewed changes

docker-compose.yml Outdated Show resolved Hide resolved

pitrou reviewed Feb 12, 2022

View reviewed changes

lidavidm reviewed Feb 12, 2022

View reviewed changes

westonpace mentioned this pull request Feb 23, 2022

ARROW-15604: [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing #12491

Closed

westonpace force-pushed the bugfix/ARROW-15604--sporadic-test-failure-with-ot branch from 3a02f28 to effc3b3 Compare February 23, 2022 19:11

lidavidm approved these changes Feb 24, 2022

View reviewed changes

pitrou approved these changes Feb 24, 2022

View reviewed changes

lidavidm mentioned this pull request Feb 25, 2022

ARROW-15789: [C++] Update OpenTelemetry to v1.2.0 #12516

Closed

westonpace added 8 commits April 12, 2022 14:38

ARROW-15604: Threads may end after the OT storage has been destroyed.…

43e5d47

… This results in use-after-free. This commit binds the OT storage lifetime to our thread pool threads so that the storage will not be destroyed until all threads have ended.

Revert "MINOR: [CI][Python] Disable OpenTelemetry in the ubuntu-pytho…

7433918

…n-sdist-test build" This reverts commit d59dbbc.

ARROW-15604: Lint

d75f1b6

ARROW-15604: Addressing PR comments

3e67623

ARROW-15604: Added some explanation to the test

3d668e7

ARROW-15604: Removed unintentional docker-compose change

76ba74a

Revert "ARROW-15604: [C++][CI] Sporadic ThreadSanitizer failure with …

9c17401

…OpenTracing" This reverts commit 6aa3070.

ARROW-15604: Addressing PR feedback. Upgrade OT to 1.3.0. Switch over…

4e6a7e7

… to grabbing OT's static context instead of resetting it with our own.

westonpace force-pushed the bugfix/ARROW-15604--sporadic-test-failure-with-ot branch from effc3b3 to 4e6a7e7 Compare April 13, 2022 01:11

ARROW-15604: OT 1.3 changed the naming of the curl lib

f9c069d

lidavidm approved these changes Apr 13, 2022

View reviewed changes

westonpace closed this in 681ede6 Apr 13, 2022

	/// \brief Keeps a resource alive until all executor threads have terminated
	/// \brief Keep a resource alive until all executor threads have terminated

ARROW-15604: [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing #12408

ARROW-15604: [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing #12408

Uh oh!

Conversation

westonpace commented Feb 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 12, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou Feb 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace commented Feb 23, 2022

Uh oh!

lidavidm commented Feb 24, 2022

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Mar 22, 2022

Uh oh!

lidavidm commented Apr 13, 2022

Uh oh!

lidavidm commented Apr 13, 2022

Uh oh!

github-actions bot commented Apr 13, 2022

Uh oh!

westonpace commented Apr 13, 2022

Uh oh!

ursabot commented Apr 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

westonpace commented Feb 12, 2022 •

edited

Loading

pitrou Feb 12, 2022 •

edited

Loading