-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3979 : [Gandiva] fix all valgrind reported errors #3201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/gandiva/projector.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really being left uninitialized?
FWIW we zero the memory in arrow::BooleanBuilder because valgrind doesn't like in-place modifications of uninitialized bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gandiva doesn't use the builders - it allocates the buffers directly in cpp code (for the batch), and update the buffers in IR code. using builders is tricky since they expect the updates also to happen through the builder APIs (eg. for tracking length).
Is it really being left uninitialized?
gandiva only updates the relevant bits. eg. for a projector with expression "a < b" having a batch of 6 elements, gandiva will update 6 bits in the output boolean vector (to either 0 or 1 depending on the values of a and b). The remaining 2 bits are left uninitialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't suggesting that you use the builders, just noting that we've also experienced valgrind issues with boolean arrays
Uninitialized bits are not an issue. I was curious why a whole byte is uninitialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still curious about the answer to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what it means is that avoid Valgrind errors with uninitialized bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, valgrind does require us to zero out the entire bitmap. @shyambits2004, can you please check if we do have a unit test that projects more than 8 elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou agreed, we zero out all of our bytes in BooleanBuilder for example. I'm trying to understand why the last byte. That is what seems weird to me -- per @pravindra it may be that the testing is not comprehensive enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but Valgrind is able to detect individual uninitialized bits. So only the trailing bits in the last byte would be a problem.
It also depends which exact operation is used for setting or clearing the bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware that valgrind had bit-level precision (http://valgrind.org/docs/memcheck2005.pdf) so this is probably it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm, you are right. valgrind complained when I modified the test to have 12 output elements. gandiva uses arrow::util::SetBitTo() to update bitmaps, and there's a comment in the function that it confuses valgrind.
I've opened ARROW-4115 for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<uint8_t> bitmap(bitmap_capacity) might be a bit more idiomatic
|
@shyambits2004 there still seem to be some valgrind failures in the CI. Maybe, the CI is using different flags ? |
|
I used "cmake .. -DARROW_GANDIVA=ON -DARROW_VALGRIND=on -DARROW_GANDIVA_BUILD_TESTS=ON -DARROW_TEST_MEMCHECK=ON" to fix all the valgrind errors. But run on travis seems to be failing with new error. I am trying to check the travis scripts to see if I missed a flag. |
|
The errors are legitimate and point to the same pattern: "Mismatched free() / delete / delete []". As @wesm said, it would be more idiomatic and less error-prone to use |
cpp/valgrind.supp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Before we add suppressions for these, perhaps it would be possible to instead reclaim the memory automatically? (for example by using unique_ptr or shared_ptr instead of raw pointers).
|
For example, in std::unique_ptr<uint8_t*> buffers_array_;But also: buffers_array_.reset(new uint8_t*[num_buffers]);This means the declaration should really be: std::unique_ptr<uint8_t*[]> buffers_array_;This mistake can mostly go silently, but not always. |
|
I was more interested in reproing the issue in my env and then fixing it. Did not want to use CI as testing. Did i miss a flag, not that I can see till now. Please let me know |
|
@shyambits2004 I don't know. This might simply be a different compiler or libstdc++ version. |
|
The CI environment is Ubuntu 14.04 / gcc 4.9 I think |
|
The reason I am more interested in reproducing the issue is I do not want ci to test bed. My environment : gcc (Ubuntu 4.9.4-2ubuntu1~14.04.1) 4.9.4 |
|
Tried all flags got from the travis job. But, unable to reproduce the problem. cmake .. -DARROW_TEST_INCLUDE_LABELS=gandiva -DARROW_NO_DEPRECATED_API=ON -DARROW_EXTRA_ERROR_CONTEXT=ON -DARROW_JEMALLOC=ON -DARROW_GANDIVA=ON -DARROW_GANDIVA_JAVA=ON -DARROW_BUILD_TESTS=ON -DARROW_TEST_MEMCHECK=ON -DCMAKE_BUILD_TYPE=debug -DBUILD_WARNING_LEVEL=CHECKIN -DARROW_GANDIVA_BUILD_TESTS=ON Any help on local repro is appreciated. |
|
Here are the valgrind errors I get on Ubuntu 14.04 building with clang-6 https://gist.github.com/wesm/b83a83b954a4bc77ce245f4f29ceff7f Here are my local options: |
|
Fell sick yesterday. Will make sure this is done today. I was unable to repro it on my local setup even with @wesm options. So, will use CI as testbed in this circumstances. If it the turnaround is taking time, we shall collaborate tonight IST/tomorrow morning PST to complete it. Hopefully, will be done today. |
a5b0b62 to
a7f5e9a
Compare
|
So when you run the following you get no errors? I'm on the same Linux distribution as you so it is disturbing to me that the results are different: |
|
$ valgrind --tool=memcheck --suppressions=../valgrind.supp debug/gandiva-projector_test [----------] Global test environment tear-down |
|
The only difference I found is suppressions are more. So, took out the suppressions file in hope that Mismatched delete was suppressed, but did not hit the Mismatched error (other errors were reported). |
|
This is actually frustrating. Just one more thing, What is the kernel version ? $ uname -a |
a7f5e9a to
f875902
Compare
|
Nearly the same kernel |
|
I'm building locally with clang-6.0. In CI it's gcc 4.9 though |
|
Ok. I am pretty sure, probably a very minute difference but hard to uncover at this point. Looks like CI does not want to act like test bed too. ~/build/apache/arrow/cpp-build ~/build/apache/arrow Did something change ? |
3ebc184 to
ef380da
Compare
c772d48 to
a66c288
Compare
pravindra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
a66c288 to
dc01aff
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me, just a few comments.
.travis.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing --only-library probably makes building slower. Is it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will go away with ARROW-3803
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this option, gandiva tests were not running as part of CI. I guess in the travis gandiva script, gandiva tests option is enabled only if "--only-library" was not enabled. But, if the 3803 takes care of it, I am fine removing it.
cpp/cmake_modules/BuildUtils.cmake
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the Valgrind documentation:
Performance overhead: origin tracking is expensive. It halves Memcheck's speed and increases memory use by a minimum of 100MB, and possibly more. Nevertheless it can drastically reduce the effort required to identify the root cause of uninitialised value errors, and so is often a programmer productivity win, despite running more slowly.
I'm not sure we want to make CI builds slower than they need to be. Is this necessary in your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say only enable this in local, if this doesn't increase the coverage of finding new bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine. It can be a local env addition, if valgrind reports some errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it "gandiva-tests" rather than "gandiva"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the result of 9fcce64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds bizarre to have the name "tests" in test labels. Is it because Gandiva has microbenchmakrs in its tests as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we removed the microbenchmarks from the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look more closely at 9fcce64
We now have targets gandiva (libraries) gandiva-tests (tests) and gandiva-benchmarks (benchmarks). The label matches the target name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. I guess it doesn't make a difference when running e.g. ctest -L arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to run ctest -L arrow-tests now but yeah https://github.com/apache/arrow/blob/master/ci/travis_script_gandiva_cpp.sh#L26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bitmap.data() is more idiomatic, though both are ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data should be zero initialized in the vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which also implies that all memset are useless with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed them
|
I will tweak this and then merge so I can get a passing build in ARROW-3803 and then merge that. @pitrou could you have a look at that, since the only thing that needs to change there is to re-enable valgrind |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Waiting for CI and then will merge and rebase ARROW-3803 with valgrind re-enabled so we can hopefully get that merged soon today
Codecov Report
@@ Coverage Diff @@
## master #3201 +/- ##
==========================================
+ Coverage 86.42% 86.42% +<.01%
==========================================
Files 508 508
Lines 70077 70080 +3
==========================================
+ Hits 60566 60570 +4
+ Misses 9404 9403 -1
Partials 107 107
Continue to review full report at Codecov.
|
|
\o/ |
Fix all the issues reported by valgrind and also enable option ARROW_TRAVIS_VALGRIND.