ARROW-33: [C++] Implement zero-copy array slicing, integrate with IPC code paths #322

wesm · 2017-02-06T02:57:04Z

This turned into a bit of a refactoring bloodbath. I have sorted through most of the issues that this turned up, so I should have this all completely working within a day or so. There will be some follow up work to do to polish things up

Closes #56.

Change-Id: I8d6845bbd58aca80fc63b3d7171558e36ce130c6

… Implement/test bitmap set bit count with offset Change-Id: Icfb98992e0c060422362ff99e44bbb1bbe40be7b

Change-Id: Ie40f415d18f3f24f35cd1a49e2138ed9cd734a35

Change-Id: I73120a386114b91063be6d4cf2f30423798eceb8

…r clarity. Test Slice for primitive arrays Change-Id: I11fce17b5581838ae907a69ea3d09094bddf9c96

Change-Id: I5128f1a3372c5f3c219a9f2eaaee438418962a82

Change-Id: Id13727cefbacadb89dee6e178175634f84d00cb0

Change-Id: I393711125073fec11860c47bf3496805be6618ba

…nter and comparison fixed for sliced bitmaps, etc. Not all working yet Change-Id: I74c0e1548b420becac2d4d329e752d5719d51912

Change-Id: I1096b38d9c78f50b7d92e2abe7af237d80dae5dd

wesm · 2017-02-06T05:59:37Z

OK, I almost have this done. The dense union case is pretty annoying so that's the last IPC case to take care of. From here I need to take care of:

Fix last IPC test case (dense union)
Testing ArrayApproxEquals on floating point arrays after slicing

This doesn't have test coverage for offsets in the JSON reader/writer, but since this is only used for integration testing I don't think needs to get done in this patch

Change-Id: I23040e716d40f6d19642165f21a25f741ce8879f

xhochy

Minor comments but looks fine in general. I'm a bit confused about the virtual pointers. As we quite often have std::shared_ptr<Array> instances I would expect that weird things could happen if the child classes don't implement virtual destructors.

xhochy · 2017-02-06T06:28:40Z

cpp/src/arrow/array.h

 /// Base class for fixed-size logical types
 class ARROW_EXPORT PrimitiveArray : public Array {
 public:
-  virtual ~PrimitiveArray() {}


Accidential delete of the destructor?

I actually spent a good bit of time googling about this.

My understanding is that if the base class has a virtual destructor, then it is not necessary to provide trivial destructor implementations in subclasses. If you forget to implement a virtual dtor in the base class, then destructing a base class instance has undefined behavior.

If this is not true, we should really find a concrete reference. The reason I looked into it was because I see a number of duplicate symbols in libarrow.so from running nm -g. I'm not really sure why that is, at some point we should figure that out

For a long time I have taken clang-static-analyzer as authority on this. If it doesn't complain about a missing virtual destructor, everything is ok.

As long as there is still a virtual destructor created for these classes, I'm fine with this. I wasn't aware of that behaviour yet.

xhochy · 2017-02-06T06:31:42Z

cpp/src/arrow/builder.h

  explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type)
      : ArrayBuilder(pool, type), data_(nullptr) {}

-  virtual ~PrimitiveBuilder() {}


Seems like there is some intention here for deleting the virtual destructors? What's the reasoning for this?

xhochy · 2017-02-06T06:35:52Z

cpp/src/arrow/compare.cc

+      if (!left.data() && !(right.data())) { return true; }
+      return left.data()->Equals(*right.data(), left.raw_value_offsets()[left.length()]);
+    } else {
+      // Compare the corresponding data range


We can only memcmp the whole range for arrays with null_count = 0.

If this is a "bug", it was present before this =) I agree that this is incorrect if there is a null value in a slot with non-zero length according to the offsets and data (e.g. a null bitmap "overlay" of some existing data). Let me add a test case for this

xhochy · 2017-02-06T06:41:29Z

cpp/src/arrow/util/bit-util.cc

+
+  // The number of bits until fast_count_start
+  const int64_t initial_bits = std::min(length, fast_count_start - bit_offset);
+  for (int64_t i = bit_offset; i < bit_offset + initial_bits; ++i) {


Using the macros defined here https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/bit-util.h#L35 should be a bit faster (or rather more understandable for the compiler).

Since this code segment touches fewer than 64 bits, I'll leave this optimization for later (to save me thinking through the indexing math =) )

xhochy · 2017-02-06T06:49:26Z

cpp/src/arrow/util/bit-util.cc

+
+  // popcount as much as possible with the widest possible count
+  for (auto iter = u64_data; iter < end; ++iter) {
+    count += __builtin_popcountll(*iter);


As this is an SSE4.2 instructions and I expect >90% of our (Python) users to have binaries compiled without -msse4.2 but having CPUs with SSE 4.2 we might need to think about runtime detection and having a portable CountSetBits and faster `CountSetBits_sse42. (Note that manylinux1 and conda packages have to stick to only SSE2 support to be portable).

agreed, I'm opening a JIRA now about it

Change-Id: Iead66e3b86a52fa9b1188e4ceb477e5cdfecd744

Change-Id: Ife519a479657883e2c93a96cad587725bfa81179

Change-Id: Ifa914f92611e00a9e3514b65680c8d2254b63a29

Change-Id: Id76bae5adf3263f0b01ef015c4a6857779262df0

wesm · 2017-02-06T16:04:38Z

@xhochy OK I'm done. The build won't pass because of the API changes which break parquet-cpp. About to put up a patch there

xhochy

+1, looked already through the commits on the way. I guess we have to live with a broken build for a day or so then.

wesm · 2017-02-06T16:09:31Z

@xhochy I fixed the Parquet build at apache/parquet-cpp#236 -- the Int96 conversion is failing, it's unclear at a glance how it could be related to this patch

wesm · 2017-02-06T16:11:51Z

It's possible that it was always broken and that the refactoring exposed a bug that was not caught before

wesm · 2017-02-06T16:12:33Z

Here is the bug I fixed: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc#L342

wesm · 2017-02-06T16:24:17Z

OK, merging so we can update the parquet-cpp version and work on the bug there

xhochy · 2017-02-06T16:59:27Z

I have one failing test case in debug mode:

[ RUN      ] RoundTripTests/TestWriteRecordBatch.SliceRoundTrip/5
.../arrow/cpp/src/arrow/buffer.cc60 Check failed: (offset) < (buffer->size())

@xhochy

See ARROW-33 patch apache/arrow#322 @xhochy this fails on Int96 timestamps. I'm not sure why yet Author: Wes McKinney <[email protected]> Closes #236 from wesm/PARQUET-866 and squashes the following commits: 4966fcb [Wes McKinney] Fix off-by-one error in int96 test case 5976d59 [Wes McKinney] Update Arrow version to head with ARROW-33 b1b69b9 [Wes McKinney] clang-format dfb2e2e [Wes McKinney] API fixes for ARROW-33 patch

@xhochy

See ARROW-33 patch apache#322 @xhochy this fails on Int96 timestamps. I'm not sure why yet Author: Wes McKinney <[email protected]> Closes apache#236 from wesm/PARQUET-866 and squashes the following commits: 4966fcb [Wes McKinney] Fix off-by-one error in int96 test case 5976d59 [Wes McKinney] Update Arrow version to head with ARROW-33 b1b69b9 [Wes McKinney] clang-format dfb2e2e [Wes McKinney] API fixes for ARROW-33 patch Change-Id: I16d3a3d51d806796eb4e56944139fa0cc2a64cab

Fix debug asserts in tests (msvc/debug build) Author: revaliu <[email protected]> Closes apache#322 from rip-nsk/PARQUET-679 and squashes the following commits: 33fc780 [revaliu] PARQUET-679: refactor too long line 057a84a [revaliu] PARQUET-679: fix "vector subscript out of range" debug assert in reader and scanner tests d50dea3 [revaliu] PARQUET-679: fix "vector iterator + offset out of range" debug assert in memory-test Change-Id: Idc8f9647b88630e07fcba37e58a3c6f4a66b6b17

wesm added 10 commits February 5, 2017 16:40

Temporary work on adding offset parameter to Array classes for slicing

bae6922

Change-Id: I8d6845bbd58aca80fc63b3d7171558e36ce130c6

Move null_count and offset as last two parameters of all array ctors.…

e502901

… Implement/test bitmap set bit count with offset Change-Id: Icfb98992e0c060422362ff99e44bbb1bbe40be7b

Implement Slice methods on Array classes

a228b50

Change-Id: Ie40f415d18f3f24f35cd1a49e2138ed9cd734a35

Implement CopyBitmap function

0355f71

Change-Id: I73120a386114b91063be6d4cf2f30423798eceb8

Rename offsets to value_offsets in list/binary/string/union for bette…

a72653d

…r clarity. Test Slice for primitive arrays Change-Id: I11fce17b5581838ae907a69ea3d09094bddf9c96

Add slice tests for struct, union, string, list

55454d7

Change-Id: I5128f1a3372c5f3c219a9f2eaaee438418962a82

Add Slice tests for DictionaryArray. Test recomputing the null count

8900d58

Change-Id: Id13727cefbacadb89dee6e178175634f84d00cb0

Add RecordBatch::Slice convenience method

b6c511e

Change-Id: I393711125073fec11860c47bf3496805be6618ba

Work on adding sliced array support to IPC code path, with pretty pri…

c6d814d

…nter and comparison fixed for sliced bitmaps, etc. Not all working yet Change-Id: I74c0e1548b420becac2d4d329e752d5719d51912

Make some more progress. dense union needs more work

1a6fcb4

Change-Id: I1096b38d9c78f50b7d92e2abe7af237d80dae5dd

Add missing include

4f08628

Change-Id: I23040e716d40f6d19642165f21a25f741ce8879f

xhochy reviewed Feb 6, 2017

View reviewed changes

wesm added 4 commits February 6, 2017 09:27

Implement slicing IPC logic for dense array

2a13929

Change-Id: Iead66e3b86a52fa9b1188e4ceb477e5cdfecd744

Make ApproxEquals for floating point arrays work on slices

9a00870

Change-Id: Ife519a479657883e2c93a96cad587725bfa81179

Python fixes, clang warning fixes

86511a3

Change-Id: Ifa914f92611e00a9e3514b65680c8d2254b63a29

Some API cleaning in builder.h

61afe42

Change-Id: Id76bae5adf3263f0b01ef015c4a6857779262df0

wesm changed the title ~~WIP ARROW-33: [C++] Implement zero-copy array slicing, integrate with IPC code paths~~ ARROW-33: [C++] Implement zero-copy array slicing, integrate with IPC code paths Feb 6, 2017

xhochy approved these changes Feb 6, 2017

View reviewed changes

wesm mentioned this pull request Feb 6, 2017

PARQUET-866: API fixes for ARROW-33 patch apache/parquet-cpp#236

Closed

asfgit closed this in 5439b71 Feb 6, 2017

wesm deleted the ARROW-33 branch February 6, 2017 16:29

toddfarmer mentioned this pull request Feb 13, 2017

[C++] StringArray/BinaryArray comparisons may be incorrect when values with non-zero length are null toddfarmer/arrow-migration#470

Closed

asfimport mentioned this pull request Feb 13, 2017

[C++] StringArray/BinaryArray comparisons may be incorrect when values with non-zero length are null #16175

Closed

ARROW-33: [C++] Implement zero-copy array slicing, integrate with IPC code paths #322

ARROW-33: [C++] Implement zero-copy array slicing, integrate with IPC code paths #322

Uh oh!

Conversation

wesm commented Feb 6, 2017

Uh oh!

wesm commented Feb 6, 2017

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Feb 6, 2017

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Feb 6, 2017

Uh oh!

wesm commented Feb 6, 2017

Uh oh!

wesm commented Feb 6, 2017

Uh oh!

wesm commented Feb 6, 2017

Uh oh!

xhochy commented Feb 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants