Skip to content

Conversation

@westonpace
Copy link
Member

@westonpace westonpace commented Feb 12, 2022

The particular cause of the failure was:

  void InputReceived(ExecNode* input, ExecBatch batch) override {
    EVENT(span_, "InputReceived", {{"batch.length", batch.length}});
    util::tracing::Span span;
    START_SPAN_WITH_PARENT(span, span_, "InputReceived",
                           {{"node.label", label()}, {"batch.length", batch.length}});

    DCHECK_EQ(input, inputs_[0]);

    bool did_push = producer_.Push(std::move(batch));
    if (!did_push) return;  // producer_ was Closed already

    if (input_counter_.Increment()) {
      // This will mark the exec plan finished
      Finish();
      // At this point the main thread is generally free to exit but span hasn't been destructed yet
    }
  }

It could be fixed with scoping:

  void InputReceived(ExecNode* input, ExecBatch batch) override {
    EVENT(span_, "InputReceived", {{"batch.length", batch.length}});
    {
      util::tracing::Span span;
      START_SPAN_WITH_PARENT(span, span_, "InputReceived",
                             {{"node.label", label()}, {"batch.length", batch.length}});

      DCHECK_EQ(input, inputs_[0]);

      bool did_push = producer_.Push(std::move(batch));
      if (!did_push) return;  // producer_ was Closed already
    }
    if (input_counter_.Increment()) {
      // This will mark the exec plan finished
      Finish();
    }
  }

However, I suspect this would quickly get tedious. Instead it seems we can control the lifetime of the OT runtime storage and bind it to our worker threads. In the future we may want to consider doing similar tricks to keep alive the memory pool, global thread pools, etc.

@github-actions
Copy link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to do this in a library? Applications or libraries using Arrow C++ may want to use their own context storage and will be surprised that Arrow overrides it. @lidavidm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this should generally be left to the application. For us, I suppose that's technically each individual test; that may get tedious, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should file an upstream issue? If we could access the upstream shared_ptr, we could keep references to it ourselves and be independent of what the application wants to do: https://github.com/open-telemetry/opentelemetry-cpp/blob/3a3bf25289901079534b1cabe14e9c4fb3b35968/api/include/opentelemetry/context/runtime_context.h#L154

A brief reproduction using a thread and TSAN might be enough to convince them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good points. I'll file an upstream issue and put this on hold for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for filing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the implemented it indeed. Maybe we can bump the bundled OpenTelemetry, or wait for a new release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they are releasing monthly. Let's just disable OT in our CI builds until the release is out and we can address this at that point.

Copy link
Member

@pitrou pitrou Feb 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably deserve a more general approach. For example:

class Executor {
 public:
  class Resource {
   public:
    virtual ~Resource();
  };
  // All live executors should keep this object alive
  static void KeepAlive(std::shared_ptr<Resource>);
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to what I think you were intending. This basically inverts the dependency so that OT has to call keepalive on the thread pools instead of the other way around. The implementation's KeepAlive method was not static though. Can you double check that I addressed your concerns here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this should generally be left to the application. For us, I suppose that's technically each individual test; that may get tedious, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should file an upstream issue? If we could access the upstream shared_ptr, we could keep references to it ourselves and be independent of what the application wants to do: https://github.com/open-telemetry/opentelemetry-cpp/blob/3a3bf25289901079534b1cabe14e9c4fb3b35968/api/include/opentelemetry/context/runtime_context.h#L154

A brief reproduction using a thread and TSAN might be enough to convince them.

lidavidm pushed a commit that referenced this pull request Feb 23, 2022
This is not a fix for the issue.  The proper fix is at #12408 but cannot be merged as we are waiting for an OT release with an upstream fix.  This is just to disable TSAN testing of OT in the meantime so we don't hide other bugs that may crop up while we wait for the proper fix.  We should revert this change as part of #12408

Closes #12491 from westonpace/bugfix/ARROW-15604--temporarily-disable-ot-in-ci

Authored-by: Weston Pace <[email protected]>
Signed-off-by: David Li <[email protected]>
@westonpace
Copy link
Member Author

I've made an attempt at addressing the PR feedback in preparation for the OT release. This should be pretty close to what we will use (I think only two lines will change once we can use their new API).

The unit test is a little unreliable. If I run it on repeat I can trigger the failure much more reliably than running it a single time. I played around with a few different approaches but couldn't come up with a variation that failed very reliably. I'm not sure if we want to leave it in or just get rid of it and rely on catching this via TSAN were there to be a regression.

@westonpace westonpace force-pushed the bugfix/ARROW-15604--sporadic-test-failure-with-ot branch from 3a02f28 to effc3b3 Compare February 23, 2022 19:11
@lidavidm
Copy link
Member

We can keep the test, I think. Even if we plan to catch regressions via TSAN, presumably this will trigger it more reliably?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good enough to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, sorry :-)

Suggested change
/// \brief Keeps a resource alive until all executor threads have terminated
/// \brief Keep a resource alive until all executor threads have terminated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... does this need ARROW_EXPORT?

lidavidm added a commit that referenced this pull request Feb 28, 2022
Quickly bump the version since it changes a few APIs we'll use (most notably for #11920).

#11963 will also need updating, but the conda-forge packages need to be updated first.

This does not include the fix needed for #12408, that will require another version bump.

Closes #12516 from lidavidm/arrow-15789

Authored-by: David Li <[email protected]>
Signed-off-by: David Li <[email protected]>
@pitrou
Copy link
Member

pitrou commented Mar 22, 2022

What is the status of this?

@lidavidm
Copy link
Member

OpenTelemetry v1.3.0 just released with the necessary fix, so we should be able to update it and get this properly fixed now.

… This results in use-after-free. This commit binds the OT storage lifetime to our thread pool threads so that the storage will not be destroyed until all threads have ended.
… to grabbing OT's static context instead of resetting it with our own.
@westonpace westonpace force-pushed the bugfix/ARROW-15604--sporadic-test-failure-with-ot branch from effc3b3 to 4e6a7e7 Compare April 13, 2022 01:11
@lidavidm
Copy link
Member

@github-actions crossbow submit test-fedora-35-cpp test-ubuntu-20.04-cpp-* test-r-ubuntu-22.04

@github-actions
Copy link

Revision: f9c069d

Submitted crossbow builds: ursacomputing/crossbow @ actions-1844

Task Status
test-fedora-35-cpp Github Actions
test-r-ubuntu-22.04 Github Actions
test-ubuntu-20.04-cpp-14 Github Actions
test-ubuntu-20.04-cpp-17 Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions

@westonpace
Copy link
Member Author

I'm pretty sure the TSAN failure is unrelated. Just to be sure I ran this via TSAN locally and it passes fine. I'm going to proceed with merge.

@ursabot
Copy link

ursabot commented Apr 14, 2022

Benchmark runs are scheduled for baseline = 6240eae and contender = 681ede6. 681ede6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.38% ⬆️0.21%] test-mac-arm
[Failed ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.34% ⬆️0.09%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/503| 681ede6f ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/489| 681ede6f test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/490| 681ede6f ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/499| 681ede6f ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/502| 6240eae2 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/488| 6240eae2 test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/489| 6240eae2 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/498| 6240eae2 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants