-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-33: [C++] Implement zero-copy array slicing, integrate with IPC code paths #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change-Id: I8d6845bbd58aca80fc63b3d7171558e36ce130c6
… Implement/test bitmap set bit count with offset Change-Id: Icfb98992e0c060422362ff99e44bbb1bbe40be7b
Change-Id: Ie40f415d18f3f24f35cd1a49e2138ed9cd734a35
Change-Id: I73120a386114b91063be6d4cf2f30423798eceb8
…r clarity. Test Slice for primitive arrays Change-Id: I11fce17b5581838ae907a69ea3d09094bddf9c96
Change-Id: I5128f1a3372c5f3c219a9f2eaaee438418962a82
Change-Id: Id13727cefbacadb89dee6e178175634f84d00cb0
Change-Id: I393711125073fec11860c47bf3496805be6618ba
…nter and comparison fixed for sliced bitmaps, etc. Not all working yet Change-Id: I74c0e1548b420becac2d4d329e752d5719d51912
Change-Id: I1096b38d9c78f50b7d92e2abe7af237d80dae5dd
|
OK, I almost have this done. The dense union case is pretty annoying so that's the last IPC case to take care of. From here I need to take care of:
This doesn't have test coverage for offsets in the JSON reader/writer, but since this is only used for integration testing I don't think needs to get done in this patch |
Change-Id: I23040e716d40f6d19642165f21a25f741ce8879f
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments but looks fine in general. I'm a bit confused about the virtual pointers. As we quite often have std::shared_ptr<Array> instances I would expect that weird things could happen if the child classes don't implement virtual destructors.
| /// Base class for fixed-size logical types | ||
| class ARROW_EXPORT PrimitiveArray : public Array { | ||
| public: | ||
| virtual ~PrimitiveArray() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidential delete of the destructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually spent a good bit of time googling about this.
My understanding is that if the base class has a virtual destructor, then it is not necessary to provide trivial destructor implementations in subclasses. If you forget to implement a virtual dtor in the base class, then destructing a base class instance has undefined behavior.
If this is not true, we should really find a concrete reference. The reason I looked into it was because I see a number of duplicate symbols in libarrow.so from running nm -g. I'm not really sure why that is, at some point we should figure that out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a long time I have taken clang-static-analyzer as authority on this. If it doesn't complain about a missing virtual destructor, everything is ok.
As long as there is still a virtual destructor created for these classes, I'm fine with this. I wasn't aware of that behaviour yet.
| explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) | ||
| : ArrayBuilder(pool, type), data_(nullptr) {} | ||
|
|
||
| virtual ~PrimitiveBuilder() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there is some intention here for deleting the virtual destructors? What's the reasoning for this?
| if (!left.data() && !(right.data())) { return true; } | ||
| return left.data()->Equals(*right.data(), left.raw_value_offsets()[left.length()]); | ||
| } else { | ||
| // Compare the corresponding data range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only memcmp the whole range for arrays with null_count = 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a "bug", it was present before this =) I agree that this is incorrect if there is a null value in a slot with non-zero length according to the offsets and data (e.g. a null bitmap "overlay" of some existing data). Let me add a test case for this
|
|
||
| // The number of bits until fast_count_start | ||
| const int64_t initial_bits = std::min(length, fast_count_start - bit_offset); | ||
| for (int64_t i = bit_offset; i < bit_offset + initial_bits; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the macros defined here https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/bit-util.h#L35 should be a bit faster (or rather more understandable for the compiler).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this code segment touches fewer than 64 bits, I'll leave this optimization for later (to save me thinking through the indexing math =) )
|
|
||
| // popcount as much as possible with the widest possible count | ||
| for (auto iter = u64_data; iter < end; ++iter) { | ||
| count += __builtin_popcountll(*iter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is an SSE4.2 instructions and I expect >90% of our (Python) users to have binaries compiled without -msse4.2 but having CPUs with SSE 4.2 we might need to think about runtime detection and having a portable CountSetBits and faster `CountSetBits_sse42. (Note that manylinux1 and conda packages have to stick to only SSE2 support to be portable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I'm opening a JIRA now about it
Change-Id: Iead66e3b86a52fa9b1188e4ceb477e5cdfecd744
Change-Id: Ife519a479657883e2c93a96cad587725bfa81179
Change-Id: Ifa914f92611e00a9e3514b65680c8d2254b63a29
Change-Id: Id76bae5adf3263f0b01ef015c4a6857779262df0
|
@xhochy OK I'm done. The build won't pass because of the API changes which break parquet-cpp. About to put up a patch there |
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, looked already through the commits on the way. I guess we have to live with a broken build for a day or so then.
|
@xhochy I fixed the Parquet build at apache/parquet-cpp#236 -- the Int96 conversion is failing, it's unclear at a glance how it could be related to this patch |
|
It's possible that it was always broken and that the refactoring exposed a bug that was not caught before |
|
Here is the bug I fixed: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc#L342 |
|
OK, merging so we can update the parquet-cpp version and work on the bug there |
|
I have one failing test case in debug mode: |
See ARROW-33 patch apache/arrow#322 @xhochy this fails on Int96 timestamps. I'm not sure why yet Author: Wes McKinney <[email protected]> Closes #236 from wesm/PARQUET-866 and squashes the following commits: 4966fcb [Wes McKinney] Fix off-by-one error in int96 test case 5976d59 [Wes McKinney] Update Arrow version to head with ARROW-33 b1b69b9 [Wes McKinney] clang-format dfb2e2e [Wes McKinney] API fixes for ARROW-33 patch
See ARROW-33 patch apache#322 @xhochy this fails on Int96 timestamps. I'm not sure why yet Author: Wes McKinney <[email protected]> Closes apache#236 from wesm/PARQUET-866 and squashes the following commits: 4966fcb [Wes McKinney] Fix off-by-one error in int96 test case 5976d59 [Wes McKinney] Update Arrow version to head with ARROW-33 b1b69b9 [Wes McKinney] clang-format dfb2e2e [Wes McKinney] API fixes for ARROW-33 patch Change-Id: I16d3a3d51d806796eb4e56944139fa0cc2a64cab
Fix debug asserts in tests (msvc/debug build) Author: revaliu <[email protected]> Closes apache#322 from rip-nsk/PARQUET-679 and squashes the following commits: 33fc780 [revaliu] PARQUET-679: refactor too long line 057a84a [revaliu] PARQUET-679: fix "vector subscript out of range" debug assert in reader and scanner tests d50dea3 [revaliu] PARQUET-679: fix "vector iterator + offset out of range" debug assert in memory-test Change-Id: Idc8f9647b88630e07fcba37e58a3c6f4a66b6b17
This turned into a bit of a refactoring bloodbath. I have sorted through most of the issues that this turned up, so I should have this all completely working within a day or so. There will be some follow up work to do to polish things up
Closes #56.