Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
ff51e35
first pass at using substrait protobufs
bkietz Nov 11, 2021
906aacf
add conversion of types and a basic roundtrip test
bkietz Nov 12, 2021
5fc6c34
reorganize to engine/substrait/{proto}_internal.cc,h
bkietz Nov 15, 2021
151cf72
add basic literal conversions
bkietz Nov 15, 2021
1cf679b
don't rely on Datum::type in DataEq matcher
bkietz Nov 16, 2021
6dce17b
add substrait_gen_verify, allow configurable substrait repo and tag
bkietz Nov 18, 2021
53c7c7d
Expose NamedStruct<=>Schema serde
bkietz Nov 19, 2021
cd919f3
add if_else <-> IfThen conversion
bkietz Nov 22, 2021
c079359
rebase, catch up to changes in substrait
bkietz Nov 30, 2021
5c69fb1
finish catching up with substrait's new field references
bkietz Dec 1, 2021
2a77d28
add SubstraitFromJSON
bkietz Dec 1, 2021
a5c25de
port more tests to JSON
bkietz Dec 3, 2021
873ec23
add more DataEq matchers
bkietz Dec 6, 2021
05be909
use Date64 for substrait::date
bkietz Dec 6, 2021
7309964
add extension types for interval_*, support deeply nested struct fiel…
bkietz Dec 6, 2021
27af6b6
add basic sketch of ExtensionSet for tracking substrait extensions
bkietz Dec 9, 2021
37a62de
get a failing test for arrow::null
bkietz Dec 13, 2021
8397ee5
refactor extension types to index variations alongside types
bkietz Dec 15, 2021
61feb19
use an ExtensionSet-local registry
bkietz Dec 15, 2021
1eb4bc0
add ExtensionSet <-> Plan factories
bkietz Dec 17, 2021
e63da1e
pre merge stash
bkietz Jan 10, 2022
f37084a
post rebase cleanup
bkietz Jan 10, 2022
2725ed7
Changes needed to get branch to compile
jvanstraten Jan 11, 2022
11e7f3f
Fix uninitialized pointers
jvanstraten Jan 11, 2022
96727e1
Fix FieldRef/StructField order
jvanstraten Jan 11, 2022
07b259c
gcc: use a pointer to the properties tuple instead of a reference
bkietz Jan 11, 2022
885836a
advance substrait version
bkietz Jan 11, 2022
0b94f3a
make JSON utils public, add CheckMessagesEquivalent()
bkietz Jan 11, 2022
98c74a8
Revert now unnecessary part of 9255fb6
jvanstraten Jan 12, 2022
6fd2e73
Support nested StructFields
jvanstraten Jan 12, 2022
0034651
Support struct_field compute function
jvanstraten Jan 12, 2022
b5e6fa4
Use ReferenceSegment.child where possible when emitting Substrait
jvanstraten Jan 12, 2022
e7184f5
Use CheckMessagesEquivalent() for test
jvanstraten Jan 12, 2022
bf93511
Fix compilation with googletest 1.11
jvanstraten Jan 12, 2022
2573bc5
add nullable field roundtrip test
bkietz Jan 12, 2022
095560f
Use lowercase nullptr in cc files
jvanstraten Jan 13, 2022
24517ff
Remove redundant else block
jvanstraten Jan 13, 2022
c4d9877
Fix clang-format'ing
jvanstraten Jan 13, 2022
f2e0e71
Simplify Fingerprintable constructor
jvanstraten Jan 13, 2022
b09d372
Simplify unique_ptr moves and casts
jvanstraten Jan 13, 2022
1557f4f
Minor fixes in suggested changes
jvanstraten Jan 13, 2022
839826b
Add tests for mixed struct references and expressions
jvanstraten Jan 13, 2022
d12545f
clean up internal::
bkietz Jan 13, 2022
9dd9c70
revert Fingerprintable change
bkietz Jan 13, 2022
9a569c9
add a simple example of substrait consumption
bkietz Jan 14, 2022
ee32bb5
add sketch of Relation conversion
bkietz Jan 14, 2022
cf33bd1
WIP on case_when support
jvanstraten Jan 13, 2022
2fc0123
Fully implement case_when(make_struct(...), ...)
jvanstraten Jan 14, 2022
9468920
Simplify ReferenceSegment manipulation functions
jvanstraten Jan 18, 2022
4da0939
add test for ReadRel conversion
bkietz Jan 19, 2022
1487b37
add function extensions to ExtensionSet
bkietz Jan 20, 2022
d237377
Add a test for extraction of an ExtensionSet from a Plan
bkietz Jan 20, 2022
5e0d6e5
add a roundtrip test for calling an extension function
bkietz Jan 20, 2022
cd22ef0
repair status_test::MatcherExplanations
bkietz Jan 21, 2022
1dc4c9d
removing old generated extension files
bkietz Jan 21, 2022
ed5b0d5
unity: ensure globals are unique within a TU
bkietz Jan 21, 2022
5d61724
ensure generated files are also excluded from lint_cpp_cli
bkietz Jan 21, 2022
37b5673
substrait consumer api cleanup
bkietz Jan 21, 2022
5b362f9
ensure generated files are also excluded from rat
bkietz Jan 21, 2022
b1a9bba
put actual json in engine_substrait_consumption.cc
bkietz Jan 21, 2022
b6499ae
msvc: suppress C4251 (needs dll-interface)
bkietz Jan 21, 2022
1473dc2
remove duplicate ARROW_ENGINE option
bkietz Jan 21, 2022
107c5b7
ensure ARROW_DATASET is available if ARROW_ENGINE=on
bkietz Jan 21, 2022
5012be6
try to localize suppressions, try defining LIBPROTOBUF_EXPORTS
bkietz Jan 21, 2022
2c73334
run cmake-format
bkietz Jan 21, 2022
65c2ee8
msvc: just suppress C4251, more int/size_t fixes
bkietz Jan 21, 2022
6d8ebc7
msvc: one more int/size_t fix
bkietz Jan 21, 2022
adfe196
msvc: one more int/size_t fix
bkietz Jan 21, 2022
0a965c5
use libprotobuf 3.19
bkietz Jan 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,7 @@ endif()

if(ARROW_ENGINE)
set(ARROW_COMPUTE ON)
set(ARROW_DATASET ON)
endif()

if(ARROW_SKYHOOK)
Expand Down
3 changes: 2 additions & 1 deletion cpp/build-support/lint_cpp_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,8 @@ def lint_file(path):
jni/
test
internal
_generated''')
_generated
generated/substrait/''')


def lint_files():
Expand Down
1 change: 1 addition & 0 deletions cpp/build-support/lint_exclusions.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
*_generated*
*generated/substrait/*.pb.*
*.grpc.fb.*
*arrowExports.cpp*
*parquet_constants.*
Expand Down
9 changes: 8 additions & 1 deletion cpp/cmake_modules/DefineOptions.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}")

define_option(ARROW_DATASET "Build the Arrow Dataset Modules" OFF)

define_option(ARROW_ENGINE "Build the Arrow Execution Engine" OFF)
define_option(ARROW_ENGINE "Build the Arrow Query Engine Module" OFF)

define_option(ARROW_FILESYSTEM "Build the Arrow Filesystem Layer" OFF)

Expand Down Expand Up @@ -478,6 +478,13 @@ advised that if this is enabled 'install' will fail silently on components;\
that have not been built"
OFF)

set(ARROW_SUBSTRAIT_REPO_AND_TAG_DEFAULT
"https://github.com/substrait-io/substrait e1b4c04a1b518912f4c4065b16a1b2c0ac8e14cf"
)
define_option_string(ARROW_SUBSTRAIT_REPO_AND_TAG
"Custom 'repository_url tag' for generating substrait accessors"
"${ARROW_SUBSTRAIT_REPO_AND_TAG_DEFAULT}")

option(ARROW_BUILD_CONFIG_SUMMARY_JSON "Summarize build configuration in a JSON file"
ON)
endif()
Expand Down
88 changes: 88 additions & 0 deletions cpp/cmake_modules/FindArrowEngine.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# - Find Arrow Engine (arrow/engine/api.h, libarrow_engine.a, libarrow_engine.so)
#
# This module requires Arrow from which it uses
# arrow_find_package()
#
# This module defines
# ARROW_ENGINE_FOUND, whether Arrow Engine has been found
# ARROW_ENGINE_IMPORT_LIB,
# path to libarrow_engine's import library (Windows only)
# ARROW_ENGINE_INCLUDE_DIR, directory containing headers
# ARROW_ENGINE_LIB_DIR, directory containing Arrow Engine libraries
# ARROW_ENGINE_SHARED_LIB, path to libarrow_engine's shared library
# ARROW_ENGINE_STATIC_LIB, path to libarrow_engine.a

if(DEFINED ARROW_ENGINE_FOUND)
return()
endif()

set(find_package_arguments)
if(${CMAKE_FIND_PACKAGE_NAME}_FIND_VERSION)
list(APPEND find_package_arguments "${${CMAKE_FIND_PACKAGE_NAME}_FIND_VERSION}")
endif()
if(${CMAKE_FIND_PACKAGE_NAME}_FIND_REQUIRED)
list(APPEND find_package_arguments REQUIRED)
endif()
if(${CMAKE_FIND_PACKAGE_NAME}_FIND_QUIETLY)
list(APPEND find_package_arguments QUIET)
endif()
find_package(Arrow ${find_package_arguments})
find_package(Parquet ${find_package_arguments})

if(ARROW_FOUND AND PARQUET_FOUND)
arrow_find_package(ARROW_ENGINE
"${ARROW_HOME}"
arrow_engine
arrow/engine/api.h
ArrowEngine
arrow-engine)
if(NOT ARROW_ENGINE_VERSION)
set(ARROW_ENGINE_VERSION "${ARROW_VERSION}")
endif()
endif()

if("${ARROW_ENGINE_VERSION}" VERSION_EQUAL "${ARROW_VERSION}")
set(ARROW_ENGINE_VERSION_MATCH TRUE)
else()
set(ARROW_ENGINE_VERSION_MATCH FALSE)
endif()

mark_as_advanced(ARROW_ENGINE_IMPORT_LIB
ARROW_ENGINE_INCLUDE_DIR
ARROW_ENGINE_LIBS
ARROW_ENGINE_LIB_DIR
ARROW_ENGINE_SHARED_IMP_LIB
ARROW_ENGINE_SHARED_LIB
ARROW_ENGINE_STATIC_LIB
ARROW_ENGINE_VERSION
ARROW_ENGINE_VERSION_MATCH)

find_package_handle_standard_args(
ArrowEngine
REQUIRED_VARS ARROW_ENGINE_INCLUDE_DIR ARROW_ENGINE_LIB_DIR ARROW_ENGINE_VERSION_MATCH
VERSION_VAR ARROW_ENGINE_VERSION)
set(ARROW_ENGINE_FOUND ${ArrowEngine_FOUND})

if(ArrowEngine_FOUND AND NOT ArrowEngine_FIND_QUIETLY)
message(STATUS "Found the Arrow Engine by ${ARROW_ENGINE_FIND_APPROACH}")
message(STATUS "Found the Arrow Engine shared library: ${ARROW_ENGINE_SHARED_LIB}")
message(STATUS "Found the Arrow Engine import library: ${ARROW_ENGINE_IMPORT_LIB}")
message(STATUS "Found the Arrow Engine static library: ${ARROW_ENGINE_STATIC_LIB}")
endif()
3 changes: 2 additions & 1 deletion cpp/cmake_modules/ThirdpartyToolchain.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,8 @@ endif()

if(ARROW_ORC
OR ARROW_FLIGHT
OR ARROW_GANDIVA)
OR ARROW_GANDIVA
OR ARROW_ENGINE)
set(ARROW_WITH_PROTOBUF ON)
endif()

Expand Down
4 changes: 4 additions & 0 deletions cpp/examples/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ if(ARROW_COMPUTE)
add_arrow_example(compute_register_example)
endif()

if(ARROW_ENGINE)
add_arrow_example(engine_substrait_consumption EXTRA_LINK_LIBS arrow_engine_shared)
endif()

if(ARROW_COMPUTE AND ARROW_CSV)
add_arrow_example(compute_and_write_csv_example)
endif()
Expand Down
187 changes: 187 additions & 0 deletions cpp/examples/arrow/engine_substrait_consumption.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include <arrow/api.h>
#include <arrow/compute/api.h>
#include <arrow/compute/exec/options.h>
#include <arrow/engine/substrait/serde.h>

#include <cstdlib>
#include <iostream>
#include <memory>
#include <vector>

namespace eng = ::arrow::engine;
namespace cp = ::arrow::compute;

#define ABORT_ON_FAILURE(expr) \
do { \
arrow::Status status_ = (expr); \
if (!status_.ok()) { \
std::cerr << status_.message() << std::endl; \
abort(); \
} \
} while (0);

arrow::Future<std::shared_ptr<arrow::Buffer>> GetSubstraitFromServer();

class IgnoringConsumer : public cp::SinkNodeConsumer {
public:
explicit IgnoringConsumer(size_t tag) : tag_{tag} {}

arrow::Status Consume(cp::ExecBatch batch) override {
// Consume a batch of data
// (just print its row count to stdout)
std::cout << "-" << tag_ << " consumed " << batch.length << " rows" << std::endl;
return arrow::Status::OK();
}

arrow::Future<> Finish() override {
// Signal to the consumer that the last batch has been delivered
// (we don't do any real work in this consumer so mark it finished immediately)
//
// The returned future should only finish when all outstanding tasks have completed
// (after this method is called Consume is guaranteed not to be called again)
std::cout << "-" << tag_ << " finished" << std::endl;
return arrow::Future<>::MakeFinished();
}

private:
size_t tag_;
};

int main(int argc, char** argv) {
// Plans arrive at the consumer serialized in a substrait-formatted Buffer
auto maybe_serialized_plan = GetSubstraitFromServer().result();
ABORT_ON_FAILURE(maybe_serialized_plan.status());
std::shared_ptr<arrow::Buffer> serialized_plan =
std::move(maybe_serialized_plan).ValueOrDie();

// Print the received plan to stdout as JSON
arrow::Result<std::string> maybe_plan_json =
eng::internal::SubstraitToJSON("Plan", *serialized_plan);
ABORT_ON_FAILURE(maybe_plan_json.status());
std::cout << std::string('#', 50) << " received substrait::Plan:" << std::endl;
std::cout << maybe_plan_json.ValueOrDie() << std::endl;

// Deserializing a plan requires a factory for consumers: each time a sink node is
// deserialized, a consumer is constructed into which its batches will be piped.
std::vector<std::shared_ptr<cp::SinkNodeConsumer>> consumers;
std::function<std::shared_ptr<cp::SinkNodeConsumer>()> consumer_factory = [&] {
// All batches produced by the plan will be fed into IgnoringConsumers:
auto tag = consumers.size();
consumers.emplace_back(new IgnoringConsumer{tag});
return consumers.back();
};

// NOTE Although most of the Deserialize functions require a const ExtensionSet& to
// resolve extension references, a Plan is what we use to construct that ExtensionSet.
// (It should be an optional output later.) In particular, it does not need to be kept
// alive nor does the serialized plan- none of the arrow:: objects in the output will
// contain references to memory owned by either.
auto maybe_decls = eng::DeserializePlan(*serialized_plan, consumer_factory);
ABORT_ON_FAILURE(maybe_decls.status());
std::vector<cp::Declaration> decls = std::move(maybe_decls).ValueOrDie();

// It's safe to drop the serialized plan; we don't leave references to its memory
serialized_plan.reset();

// Construct an empty plan (note: configure Function registry and ThreadPool here)
auto maybe_plan = cp::ExecPlan::Make();
ABORT_ON_FAILURE(maybe_plan.status());
std::shared_ptr<cp::ExecPlan> plan = std::move(maybe_plan).ValueOrDie();

for (const cp::Declaration& decl : decls) {
// Add decl to plan (note: configure ExecNode registry here)
ABORT_ON_FAILURE(decl.AddToPlan(plan.get()).status());
}

// Validate the plan and print it to stdout
ABORT_ON_FAILURE(plan->Validate());
std::cout << std::string('#', 50) << " produced arrow::ExecPlan:" << std::endl;
std::cout << plan->ToString() << std::endl;

// Start the plan...
std::cout << std::string('#', 50) << " consuming batches:" << std::endl;
ABORT_ON_FAILURE(plan->StartProducing());

// ... and wait for it to finish
ABORT_ON_FAILURE(plan->finished().status());
return EXIT_SUCCESS;
}

arrow::Future<std::shared_ptr<arrow::Buffer>> GetSubstraitFromServer() {
// Emulate server interaction by parsing hard coded JSON
return eng::internal::SubstraitFromJSON("Plan", R"({
"relations": [
{"rel": {
"read": {
"base_schema": {
"struct": {
"types": [ {"i64": {}}, {"bool": {}} ]
},
"names": ["i", "b"]
},
"filter": {
"selection": {
"directReference": {
"structField": {
"field": 1
}
}
}
},
"local_files": {
"items": [
{
"uri_file": "file:///tmp/dat1.parquet",
"format": "FILE_FORMAT_PARQUET"
},
{
"uri_file": "file:///tmp/dat2.parquet",
"format": "FILE_FORMAT_PARQUET"
}
]
}
}
}}
],
"extension_uris": [
{
"extension_uri_anchor": 7,
"uri": "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml"
}
],
"extensions": [
{"extension_type": {
"extension_uri_reference": 7,
"type_anchor": 42,
"name": "null"
}},
{"extension_type_variation": {
"extension_uri_reference": 7,
"type_variation_anchor": 23,
"name": "u8"
}},
{"extension_function": {
"extension_uri_reference": 7,
"function_anchor": 42,
"name": "add"
}}
]
})");
}
4 changes: 4 additions & 0 deletions cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -722,6 +722,10 @@ if(ARROW_COMPUTE)
add_subdirectory(compute)
endif()

if(ARROW_ENGINE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_ENGINE depends on ARROW_COMPUTE, can we set it somewhere?

Perhaps consider to place it under the compute subdirectory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe ARROW_ENGINE may also someday depend on datasets, ipc, parquet, and maybe even flight. For example, a substrait plan will generally start with a scan node (datasets) and the engine may need to use spillover (ipc / parquet) and we might want to send data to or receive data from flight nodes.

Some of this we could probably avoid using more indirection (e.g. substrait consumer defines a "table provider" and the user can use the "datasets table provider" to link the two modules) but to start with it might be easier to just do the simple thing.

Either way, I think we will eventually want a standalone engine module that isn't really just a child of compute so I'm kind of in favor of it being a peer (and not a child) of compute.

See: ARROW-15238 (which this PR satisfies)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_ENGINE does now depend on datasets, where a scan node is used to wrap substrait::ReadRels

add_subdirectory(engine)
endif()

if(ARROW_CUDA)
add_subdirectory(gpu)
endif()
Expand Down
2 changes: 2 additions & 0 deletions cpp/src/arrow/array/array_base.cc
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,8 @@ std::string Array::ToString() const {
return ss.str();
}

void PrintTo(const Array& x, std::ostream* os) { *os << x.ToString(); }

Result<std::shared_ptr<Array>> Array::View(
const std::shared_ptr<DataType>& out_type) const {
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ArrayData> result,
Expand Down
7 changes: 5 additions & 2 deletions cpp/src/arrow/array/array_base.h
Original file line number Diff line number Diff line change
Expand Up @@ -187,10 +187,11 @@ class ARROW_EXPORT Array {
Status ValidateFull() const;

protected:
Array() : null_bitmap_data_(NULLPTR) {}
Array() = default;
ARROW_DEFAULT_MOVE_AND_ASSIGN(Array);

std::shared_ptr<ArrayData> data_;
const uint8_t* null_bitmap_data_;
const uint8_t* null_bitmap_data_ = NULLPTR;

/// Protected method for constructors
void SetData(const std::shared_ptr<ArrayData>& data) {
Expand All @@ -204,6 +205,8 @@ class ARROW_EXPORT Array {

private:
ARROW_DISALLOW_COPY_AND_ASSIGN(Array);

ARROW_EXPORT friend void PrintTo(const Array& x, std::ostream* os);
};

static inline std::ostream& operator<<(std::ostream& os, const Array& x) {
Expand Down
Loading