Skip to content

[C++] A way to assert that arrays of a certain type have the validity of the values defined by the validity bitmap #41114

@felipecrv

Description

@felipecrv

Describe the enhancement requested

Certain types don't have the validity defined by the bitmap buffer's presence and bit values:

  • NA
  • SPARSE_UNION
  • DENSE_UNION
  • RUN_END_ENCODED

This means that recently inline bool ArrayData::IsValid() had to grow:

bool IsValid(int64_t i) const {
if (buffers[0] != NULLPTR) {
return bit_util::GetBit(buffers[0]->data(), i + offset);
}
const auto type = this->type->id();
if (type == Type::SPARSE_UNION) {
return !internal::IsNullSparseUnion(*this, i);
}
if (type == Type::DENSE_UNION) {
return !internal::IsNullDenseUnion(*this, i);
}
if (type == Type::RUN_END_ENCODED) {
return !internal::IsNullRunEndEncoded(*this, i);
}
return null_count.load() != length;
}

Many type-specific loops in Arrow don't use IsValid and check the bitmap directly instead (good to reduce the branching in hot loops or to do lower-level operations with the bitmaps).

To be future-proof, these loops would have to migrate to using IsValid, but since doing that would be overkill for most cases (unions and REEs are rare), I propose the creation of a function close to the special validity checking functions that checks if a type allows for validity checks to be performed solely through the bitmap:

constexpr bool is_validity_defined_by_bitmap(Type::type id) {
  switch (id) {
    case Type::NA:
    case Type::SPARSE_UNION:
    case Type::DENSE_UNION:
    case Type::RUN_END_ENCODED:
      return false;
    default:
      return true;
  }
}

This means that if a template or function that only looks at the validity bitmap starts being used for these special types we can get static_assert, assert and DCHECK violations.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions