Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
cc7f851
add Validate method to array and implementation for ListArray
emkornfield Apr 13, 2016
01c50be
Add utility methods for managing null bitmap directly to ArrayBuilder
emkornfield Apr 13, 2016
3895d34
Refactor list builder to use ArrayBuilders bitmap methods and a separ…
emkornfield Apr 13, 2016
20f984b
refactor primitive builders to use parent builders bitmap
emkornfield Apr 13, 2016
1374485
augment python unittest to have null element in list
emkornfield Apr 13, 2016
45e41c0
Make BufferBuilder more useable for appending primitives
emkornfield Apr 13, 2016
5f87aef
Refactor ipc-adapter-test to make it paramaterizable. add unit test …
emkornfield Apr 13, 2016
61b0481
small fixes to naming/style for c++ and potential bugs
emkornfield Apr 13, 2016
a2e1e52
native popcount
emkornfield Apr 13, 2016
39c57ed
add potentially useful methods for generative arrays to ipc test-common
emkornfield Apr 13, 2016
aa0602c
add potentially useful pool factories to test utils
emkornfield Apr 13, 2016
8e464b5
Fixes per tidy and lint
emkornfield Apr 13, 2016
53d37bc
filter out ipc-adapter-test from tidy
emkornfield Apr 13, 2016
8ab5315
make clang tidy ignore a little bit less hacky
emkornfield Apr 17, 2016
e71810b
make Resize and Init virtual on builder
emkornfield Apr 17, 2016
2e6c477
add missing RETURN_NOT_OK
emkornfield Apr 17, 2016
3b219a1
Make append is_null parameter is_valid for api consistency
emkornfield Apr 18, 2016
10e6651
add in maximum recursion depth, surfaced possible recursion issue wit…
emkornfield Apr 18, 2016
be04b3e
add unit tests for zero length row batches and non-null batches. fix…
emkornfield Apr 18, 2016
8982723
remaining style cleanup
emkornfield Apr 18, 2016
5e15815
fix make lint
emkornfield Apr 18, 2016
6e57728
make format fixes
emkornfield Apr 18, 2016
7789205
make clang-format-3.7 happy
emkornfield Apr 18, 2016
0af558b
remove a now unnecessary NOLINT, but mostly to trigger another travis…
emkornfield Apr 18, 2016
0c5162d
another format fix
emkornfield Apr 18, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -565,7 +565,7 @@ if (${CLANG_TIDY_FOUND})
`find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc | sed -e '/_generated/g'`)
# runs clang-tidy and exits with a non-zero exit code if any errors are found.
add_custom_target(check-clang-tidy ${BUILD_SUPPORT_DIR}/run-clang-tidy.sh ${CLANG_TIDY_BIN} ${CMAKE_BINARY_DIR}/compile_commands.json
0 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc | sed -e '/_generated/g'`)
0 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc |grep -v -F -f ${CMAKE_CURRENT_SOURCE_DIR}/src/.clang-tidy-ignore | sed -e '/_generated/g'`)

endif()

Expand Down
9 changes: 8 additions & 1 deletion cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,4 +76,11 @@ build failures by running the following checks before submitting your pull reque

Note that the clang-tidy target may take a while to run. You might consider
running clang-tidy separately on the files you have added/changed before
invoking the make target to reduce iteration time.
invoking the make target to reduce iteration time. Also, it might generate warnings
that aren't valid. To avoid these you can use add a line comment `// NOLINT`. If
NOLINT doesn't suppress the warnings, you add the file in question to
the .clang-tidy-ignore file. This will allow `make check-clang-tidy` to pass in
travis-CI (but still surface the potential warnings in `make clang-tidy`). Ideally,
both of these options would be used rarely. Current known uses-cases whent hey are required:

* Parameterized tests in google test.
1 change: 1 addition & 0 deletions cpp/src/.clang-tidy-ignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ipc-adapter-test.cc
5 changes: 5 additions & 0 deletions cpp/src/arrow/array.cc
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
#include <cstdint>

#include "arrow/util/buffer.h"
#include "arrow/util/status.h"

namespace arrow {

Expand Down Expand Up @@ -47,6 +48,10 @@ bool Array::EqualsExact(const Array& other) const {
return true;
}

Status Array::Validate() const {
return Status::OK();
}

bool NullArray::Equals(const std::shared_ptr<Array>& arr) const {
if (this == arr.get()) { return true; }
if (Type::NA != arr->type_enum()) { return false; }
Expand Down
6 changes: 5 additions & 1 deletion cpp/src/arrow/array.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
namespace arrow {

class Buffer;
class Status;

// Immutable data array with some logical type and some length. Any memory is
// owned by the respective Buffer instance (or its parents).
Expand All @@ -39,7 +40,7 @@ class Array {
Array(const std::shared_ptr<DataType>& type, int32_t length, int32_t null_count = 0,
const std::shared_ptr<Buffer>& null_bitmap = nullptr);

virtual ~Array() {}
virtual ~Array() = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I tried to find definitive documentation on when to use = default vs = 0 in C++11, if you have any pointers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is default == {}, and destructor = 0, should be used rarely. From: http://en.cppreference.com/w/cpp/language/destructor "A destructor may be declared pure virtual, for example in a base class which needs to be made abstract, but has no other suitable functions that could be declared pure virtual. Such destructor must have a definition, since all base class destructors are always called when the derived class is destroyed" Although the second sentence seems to contradict the first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to default purely for style (I likely should have done this as a separate PR and discussed), for some reason it reads more elegantly to me.


// Determine if a slot is null. For inner loops. Does *not* boundscheck
bool IsNull(int i) const {
Expand All @@ -58,6 +59,9 @@ class Array {

bool EqualsExact(const Array& arr) const;
virtual bool Equals(const std::shared_ptr<Array>& arr) const = 0;
// Determines if the array is internally consistent. Defaults to always
// returning Status::OK. This can be an expensive check.
virtual Status Validate() const;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon my dumb question, my understanding of Status just seeing its usage in the code I thought it's signaling either an operation succeeded or failed, but validate seems to imply it's either in a valid state or invalid state (like a bool). Why not just use bool? Or does Status can also encode error information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status can encode error information.

Copy link
Member

@wesm wesm Apr 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some work last year in Impala in which they analyzed the generated x86 instructions to see how to minimize the CPU cycles associated with verifying Status::OK: apache/impala@1afe728


protected:
std::shared_ptr<DataType> type_;
Expand Down
56 changes: 56 additions & 0 deletions cpp/src/arrow/builder.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,25 @@

namespace arrow {

Status ArrayBuilder::AppendToBitmap(bool is_valid) {
if (length_ == capacity_) {
// If the capacity was not already a multiple of 2, do so here
// TODO(emkornfield) doubling isn't great default allocation practice
// see https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md
// fo discussion
RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1)));
}
UnsafeAppendToBitmap(is_valid);
return Status::OK();
}

Status ArrayBuilder::AppendToBitmap(const uint8_t* valid_bytes, int32_t length) {
RETURN_NOT_OK(Reserve(length));

UnsafeAppendToBitmap(valid_bytes, length);
return Status::OK();
}

Status ArrayBuilder::Init(int32_t capacity) {
capacity_ = capacity;
int32_t to_alloc = util::ceil_byte(capacity) / 8;
Expand All @@ -36,6 +55,7 @@ Status ArrayBuilder::Init(int32_t capacity) {
}

Status ArrayBuilder::Resize(int32_t new_bits) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way I was thinking it would be nice to make Resize and Init virtual methods, I think it would reduce code repetition and potential bugs for classes the don't have any reason to override Reserve. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, go ahead

if (!null_bitmap_) { return Init(new_bits); }
int32_t new_bytes = util::ceil_byte(new_bits) / 8;
int32_t old_bytes = null_bitmap_->size();
RETURN_NOT_OK(null_bitmap_->Resize(new_bytes));
Expand All @@ -56,10 +76,46 @@ Status ArrayBuilder::Advance(int32_t elements) {

Status ArrayBuilder::Reserve(int32_t elements) {
if (length_ + elements > capacity_) {
// TODO(emkornfield) power of 2 growth is potentially suboptimal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: should we do 1.5x growth everywhere (this is the folly approach IIRC -- is there more research on this subject?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is what folly uses. Seems like 1.5 edges out 2 on the very light survey done here: https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor

Ideally, we would benchmark once we have some real data in place. In the absence of that 1.5 seems like a good default.

int32_t new_capacity = util::next_power2(length_ + elements);
return Resize(new_capacity);
}
return Status::OK();
}

Status ArrayBuilder::SetNotNull(int32_t length) {
RETURN_NOT_OK(Reserve(length));
UnsafeSetNotNull(length);
return Status::OK();
}

void ArrayBuilder::UnsafeAppendToBitmap(bool is_valid) {
if (is_valid) {
util::set_bit(null_bitmap_data_, length_);
} else {
++null_count_;
}
++length_;
}

void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t length) {
if (valid_bytes == nullptr) {
UnsafeSetNotNull(length);
return;
}
for (int32_t i = 0; i < length; ++i) {
// TODO(emkornfield) Optimize for large values of length?
UnsafeAppendToBitmap(valid_bytes[i] > 0);
}
}

void ArrayBuilder::UnsafeSetNotNull(int32_t length) {
const int32_t new_length = length + length_;
// TODO(emkornfield) Optimize for large values of length?
for (int32_t i = length_; i < new_length; ++i) {
util::set_bit(null_bitmap_data_, i);
}
length_ = new_length;
}

} // namespace arrow
46 changes: 37 additions & 9 deletions cpp/src/arrow/builder.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,10 @@ class PoolBuffer;

static constexpr int32_t MIN_BUILDER_CAPACITY = 1 << 5;

// Base class for all data array builders
// Base class for all data array builders.
// This class provides a facilities for incrementally building the null bitmap
// (see Append methods) and as a side effect the current number of slots and
// the null count.
class ArrayBuilder {
public:
explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type)
Expand All @@ -46,7 +49,7 @@ class ArrayBuilder {
length_(0),
capacity_(0) {}

virtual ~ArrayBuilder() {}
virtual ~ArrayBuilder() = default;

// For nested types. Since the objects are owned by this class instance, we
// skip shared pointers and just return a raw pointer
Expand All @@ -58,14 +61,27 @@ class ArrayBuilder {
int32_t null_count() const { return null_count_; }
int32_t capacity() const { return capacity_; }

// Allocates requires memory at this level, but children need to be
// initialized independently
Status Init(int32_t capacity);
// Append to null bitmap
Status AppendToBitmap(bool is_valid);
// Vector append. Treat each zero byte as a null. If valid_bytes is null
// assume all of length bits are valid.
Status AppendToBitmap(const uint8_t* valid_bytes, int32_t length);
// Set the next length bits to not null (i.e. valid).
Status SetNotNull(int32_t length);

// Resizes the null_bitmap array
Status Resize(int32_t new_bits);
// Allocates initial capacity requirements for the builder. In most
// cases subclasses should override and call there parent classes
// method as well.
virtual Status Init(int32_t capacity);

Status Reserve(int32_t extra_bits);
// Resizes the null_bitmap array. In most
// cases subclasses should override and call there parent classes
// method as well.
virtual Status Resize(int32_t new_bits);

// Ensures there is enough space for adding the number of elements by checking
// capacity and calling Resize if necessary.
Status Reserve(int32_t elements);

// For cases where raw data was memcpy'd into the internal buffers, allows us
// to advance the length of the builder. It is your responsibility to use
Expand All @@ -75,7 +91,7 @@ class ArrayBuilder {
const std::shared_ptr<PoolBuffer>& null_bitmap() const { return null_bitmap_; }

// Creates new array object to hold the contents of the builder and transfers
// ownership of the data
// ownership of the data. This resets all variables on the builder.
virtual std::shared_ptr<Array> Finish() = 0;

const std::shared_ptr<DataType>& type() const { return type_; }
Expand All @@ -97,6 +113,18 @@ class ArrayBuilder {
// Child value array builders. These are owned by this class
std::vector<std::unique_ptr<ArrayBuilder>> children_;

//
// Unsafe operations (don't check capacity/don't resize)
//

// Append to null bitmap.
void UnsafeAppendToBitmap(bool is_valid);
// Vector append. Treat each zero byte as a nullzero. If valid_bytes is null
// assume all of length bits are valid.
void UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t length);
// Set the next length bits to not null (i.e. valid).
void UnsafeSetNotNull(int32_t length);

private:
DISALLOW_COPY_AND_ASSIGN(ArrayBuilder);
};
Expand Down
Loading