Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Feb 6, 2017

This turned into a bit of a refactoring bloodbath. I have sorted through most of the issues that this turned up, so I should have this all completely working within a day or so. There will be some follow up work to do to polish things up

Closes #56.

wesm added 10 commits February 5, 2017 16:40
Change-Id: I8d6845bbd58aca80fc63b3d7171558e36ce130c6
… Implement/test bitmap set bit count with offset

Change-Id: Icfb98992e0c060422362ff99e44bbb1bbe40be7b
Change-Id: Ie40f415d18f3f24f35cd1a49e2138ed9cd734a35
Change-Id: I73120a386114b91063be6d4cf2f30423798eceb8
…r clarity. Test Slice for primitive arrays

Change-Id: I11fce17b5581838ae907a69ea3d09094bddf9c96
Change-Id: I5128f1a3372c5f3c219a9f2eaaee438418962a82
Change-Id: Id13727cefbacadb89dee6e178175634f84d00cb0
Change-Id: I393711125073fec11860c47bf3496805be6618ba
…nter and comparison fixed for sliced bitmaps, etc. Not all working yet

Change-Id: I74c0e1548b420becac2d4d329e752d5719d51912
Change-Id: I1096b38d9c78f50b7d92e2abe7af237d80dae5dd
@wesm
Copy link
Member Author

wesm commented Feb 6, 2017

OK, I almost have this done. The dense union case is pretty annoying so that's the last IPC case to take care of. From here I need to take care of:

  • Fix last IPC test case (dense union)
  • Testing ArrayApproxEquals on floating point arrays after slicing

This doesn't have test coverage for offsets in the JSON reader/writer, but since this is only used for integration testing I don't think needs to get done in this patch

Change-Id: I23040e716d40f6d19642165f21a25f741ce8879f
Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments but looks fine in general. I'm a bit confused about the virtual pointers. As we quite often have std::shared_ptr<Array> instances I would expect that weird things could happen if the child classes don't implement virtual destructors.

/// Base class for fixed-size logical types
class ARROW_EXPORT PrimitiveArray : public Array {
public:
virtual ~PrimitiveArray() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidential delete of the destructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually spent a good bit of time googling about this.

My understanding is that if the base class has a virtual destructor, then it is not necessary to provide trivial destructor implementations in subclasses. If you forget to implement a virtual dtor in the base class, then destructing a base class instance has undefined behavior.

If this is not true, we should really find a concrete reference. The reason I looked into it was because I see a number of duplicate symbols in libarrow.so from running nm -g. I'm not really sure why that is, at some point we should figure that out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a long time I have taken clang-static-analyzer as authority on this. If it doesn't complain about a missing virtual destructor, everything is ok.

As long as there is still a virtual destructor created for these classes, I'm fine with this. I wasn't aware of that behaviour yet.

explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type)
: ArrayBuilder(pool, type), data_(nullptr) {}

virtual ~PrimitiveBuilder() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there is some intention here for deleting the virtual destructors? What's the reasoning for this?

if (!left.data() && !(right.data())) { return true; }
return left.data()->Equals(*right.data(), left.raw_value_offsets()[left.length()]);
} else {
// Compare the corresponding data range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only memcmp the whole range for arrays with null_count = 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a "bug", it was present before this =) I agree that this is incorrect if there is a null value in a slot with non-zero length according to the offsets and data (e.g. a null bitmap "overlay" of some existing data). Let me add a test case for this


// The number of bits until fast_count_start
const int64_t initial_bits = std::min(length, fast_count_start - bit_offset);
for (int64_t i = bit_offset; i < bit_offset + initial_bits; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the macros defined here https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/bit-util.h#L35 should be a bit faster (or rather more understandable for the compiler).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this code segment touches fewer than 64 bits, I'll leave this optimization for later (to save me thinking through the indexing math =) )


// popcount as much as possible with the widest possible count
for (auto iter = u64_data; iter < end; ++iter) {
count += __builtin_popcountll(*iter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is an SSE4.2 instructions and I expect >90% of our (Python) users to have binaries compiled without -msse4.2 but having CPUs with SSE 4.2 we might need to think about runtime detection and having a portable CountSetBits and faster `CountSetBits_sse42. (Note that manylinux1 and conda packages have to stick to only SSE2 support to be portable).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I'm opening a JIRA now about it

wesm added 4 commits February 6, 2017 09:27
Change-Id: Iead66e3b86a52fa9b1188e4ceb477e5cdfecd744
Change-Id: Ife519a479657883e2c93a96cad587725bfa81179
Change-Id: Ifa914f92611e00a9e3514b65680c8d2254b63a29
Change-Id: Id76bae5adf3263f0b01ef015c4a6857779262df0
@wesm wesm changed the title WIP ARROW-33: [C++] Implement zero-copy array slicing, integrate with IPC code paths ARROW-33: [C++] Implement zero-copy array slicing, integrate with IPC code paths Feb 6, 2017
@wesm
Copy link
Member Author

wesm commented Feb 6, 2017

@xhochy OK I'm done. The build won't pass because of the API changes which break parquet-cpp. About to put up a patch there

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, looked already through the commits on the way. I guess we have to live with a broken build for a day or so then.

@wesm
Copy link
Member Author

wesm commented Feb 6, 2017

@xhochy I fixed the Parquet build at apache/parquet-cpp#236 -- the Int96 conversion is failing, it's unclear at a glance how it could be related to this patch

@wesm
Copy link
Member Author

wesm commented Feb 6, 2017

It's possible that it was always broken and that the refactoring exposed a bug that was not caught before

@wesm
Copy link
Member Author

wesm commented Feb 6, 2017

@wesm
Copy link
Member Author

wesm commented Feb 6, 2017

OK, merging so we can update the parquet-cpp version and work on the bug there

@asfgit asfgit closed this in 5439b71 Feb 6, 2017
@wesm wesm deleted the ARROW-33 branch February 6, 2017 16:29
@xhochy
Copy link
Member

xhochy commented Feb 6, 2017

I have one failing test case in debug mode:

[ RUN      ] RoundTripTests/TestWriteRecordBatch.SliceRoundTrip/5
.../arrow/cpp/src/arrow/buffer.cc60 Check failed: (offset) < (buffer->size())

asfgit pushed a commit to apache/parquet-cpp that referenced this pull request Feb 6, 2017
See ARROW-33 patch apache/arrow#322

@xhochy this fails on Int96 timestamps. I'm not sure why yet

Author: Wes McKinney <[email protected]>

Closes #236 from wesm/PARQUET-866 and squashes the following commits:

4966fcb [Wes McKinney] Fix off-by-one error in int96 test case
5976d59 [Wes McKinney] Update Arrow version to head with ARROW-33
b1b69b9 [Wes McKinney] clang-format
dfb2e2e [Wes McKinney] API fixes for ARROW-33 patch
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
See ARROW-33 patch apache#322

@xhochy this fails on Int96 timestamps. I'm not sure why yet

Author: Wes McKinney <[email protected]>

Closes apache#236 from wesm/PARQUET-866 and squashes the following commits:

4966fcb [Wes McKinney] Fix off-by-one error in int96 test case
5976d59 [Wes McKinney] Update Arrow version to head with ARROW-33
b1b69b9 [Wes McKinney] clang-format
dfb2e2e [Wes McKinney] API fixes for ARROW-33 patch

Change-Id: I16d3a3d51d806796eb4e56944139fa0cc2a64cab
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 8, 2018
Fix debug asserts in tests (msvc/debug build)

Author: revaliu <[email protected]>

Closes apache#322 from rip-nsk/PARQUET-679 and squashes the following commits:

33fc780 [revaliu] PARQUET-679: refactor too long line
057a84a [revaliu] PARQUET-679: fix "vector subscript out of range" debug assert in reader and scanner tests
d50dea3 [revaliu] PARQUET-679: fix "vector iterator + offset out of range" debug assert in memory-test

Change-Id: Idc8f9647b88630e07fcba37e58a3c6f4a66b6b17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants