ARROW 16968: [C++] Expand Python-UDF support to Arrow Substrait #13500

rtpsw · 2022-07-03T08:50:49Z

See https://issues.apache.org/jira/browse/ARROW-16968

github-actions · 2022-07-03T08:51:10Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

rtpsw · 2022-07-03T08:58:54Z

This PR can be usefully reviewed despite it being close to but not entirely ready due to these reasons:

There is currently commented-out code-block for UDF-Substrait parsing due to pending agreement with Substrait people on how to represent UDFs in a Substrait plan. I expect the future agreement would not lead to significant changes outside this block of code.
I have PyArrow UDF unit tests that I did not included yet due to the same pending agreement.

Also, please ignore all recent commits except the last one.

vibhatha · 2022-07-04T02:43:25Z

cpp/src/arrow/engine/substrait/serde.cc

+  /*
+  for (const auto& ext : plan.extensions()) {
+    switch (ext.mapping_type_case()) {
+      case substrait::extensions::SimpleExtensionDeclaration::kExtensionFunction: {
+        const auto& fn = ext.extension_function();
+        if (fn.has_udf()) {
+          const auto& udf = fn.udf();
+          const auto& in_types = udf.input_types();
+          int size = in_types.size();
+          std::vector<std::pair<std::shared_ptr<DataType>, bool>> input_types;
+          for (int i=0; i<size; i++) {
+            ARROW_ASSIGN_OR_RAISE(auto input_type, FromProto(in_types.Get(i), ext_set));
+            input_types.push_back(std::move(input_type));
+          }
+          ARROW_ASSIGN_OR_RAISE(auto output_type, FromProto(udf.output_type(), ext_set));
+          decls.push_back(std::move(UdfDeclaration{
+            fn.name(),
+            udf.code(),
+            udf.summary(),
+            udf.description(),
+            std::move(input_types),
+            std::move(output_type),
+          }));
+        }
+        break;
+      }
+      default: {
+        break;
+      }
+    }
+  }
+  */


Is this not used?

This is the commented-out code-block explained here.

vibhatha · 2022-07-04T02:45:00Z

python/pyarrow/tests/test_udf.py


 import pytest

+import numpy as np


Why this is important?

I think this is a leftover of my removing unit tests as explained here. I can remove this line but may need it back when the unit tests are included.

vibhatha · 2022-07-04T02:45:46Z

python/pyarrow/public-api.pxi

+    if pyarrow_is_extension_id_registry(registry):
+        reg = <ExtensionIdRegistry>(registry)
+        return reg.sp_registry
+


nit: we can get rid of this empty line.

vibhatha · 2022-07-04T02:46:01Z

python/pyarrow/public-api.pxi

+    if pyarrow_is_function_registry(registry):
+        reg = <BaseFunctionRegistry>(registry)
+        return reg.registry
+


nit: Maybe remove this empty line?

vibhatha · 2022-07-04T02:47:50Z

python/pyarrow/tests/test_substrait.py

 import pyarrow as pa
 from pyarrow.lib import tobytes
 from pyarrow.lib import ArrowInvalid
+from pyarrow.substrait import make_extension_id_registry


nit: This should probably come after the try-catch block for checking non-erroneous substrait import.

It's not useful at all, is it? Can just call substrait.make_extension_id_registry?

vibhatha · 2022-07-04T02:48:55Z

python/pyarrow/lib.pxd

 cdef public shared_ptr[CRecordBatch] pyarrow_unwrap_batch(object batch)
 cdef public shared_ptr[CTable] pyarrow_unwrap_table(object table)
+
+cdef public CFunctionRegistry* pyarrow_unwrap_function_registry(object registry)


should we consider using a shared_ptr instead?
cc @lidavidm

This has crossed my mind when I was working on the code of this PR. I also think a shared_ptr would make things easier but will require changes across Py/Arrow that are likely better handled in a dedicated PR.

vibhatha · 2022-07-04T02:51:35Z

python/pyarrow/_substrait.pyx

    ----------
    plan : Buffer
        The serialized Substrait plan to execute.
+    extid_registry : ExtensionIdRegistry


nit: May be ext_id_registry?

vibhatha · 2022-07-04T02:54:14Z

python/pyarrow/_substrait.pyx

+from pyarrow._compute cimport FunctionRegistry
+
+
+from pyarrow._exec_plan cimport is_supported_execplan_output_type, execplan
+from pyarrow._compute import make_function_registry


nit: order of imports?

vibhatha · 2022-07-04T02:56:42Z

@rtpsw I went through the PR very abstractly. Will look into this again. Shouldn't it be better to wait until Substrait related factor are finalized.

cc @westonpace

rtpsw · 2022-07-04T09:51:53Z

@rtpsw I went through the PR very abstractly. Will look into this again. Shouldn't it be better to wait until Substrait related factor are finalized.

Despite the Substrait agreement being pending, I posted this PR due to a chicken-and-egg problem: if the overall approach in this PR is rejected, then the Substrait agreement I'm currently pursuing could be irrelevant. At this time, I'm seeking only an approval of the approach here rather than a detailed review. Assuming this approval is given, I'll work to reach the Substrait agreement and get back to this PR.

rtpsw · 2022-07-06T11:55:36Z

@westonpace, could you quickly check to confirm the general approach in this PR is acceptable, for the purpose noted here?

westonpace · 2022-07-06T21:43:34Z

The approach, if I'm understanding correctly, is to use C++ to make two passes through the plan (or maybe its one pass). The first pass gets all the UDFs out of the plan. Pyarrow then unpickles and registers those UDFs. The second actually consumes the plan, using a registry that contains those unpickled functions.

This wouldn't be my first approach. I think I'd prefer adding another callback like the consumer_factory for UDF handling. This would make it easier to handle situations where there are alternative UDF handlers. Or, for example, a C++ or R user that still wants to be able to run python UDFs. However, I'm not opposed to this approach. The end pyarrow interface to the user is still just "substrait in->data out" so if we wanted to move to a different approach in the future that would be fine.

rtpsw · 2022-07-06T22:31:17Z

The approach, if I'm understanding correctly, is to use C++ to make two passes through the plan (or maybe its one pass). The first pass gets all the UDFs out of the plan. Pyarrow then unpickles and registers those UDFs. The second actually consumes the plan, using a registry that contains those unpickled functions.

This is a fair description. For the purpose of alignment with my corresponding Substrait proposal, could you confirm the data associated with each UDF is appropriate/acceptable? If so, I'll ensure it gets expressed in the Substrait plan, even if it end up being organized differently there.

This wouldn't be my first approach. I think I'd prefer adding another callback like the consumer_factory for UDF handling. This would make it easier to handle situations where there are alternative UDF handlers. Or, for example, a C++ or R user that still wants to be able to run python UDFs. However, I'm not opposed to this approach. The end pyarrow interface to the user is still just "substrait in->data out" so if we wanted to move to a different approach in the future that would be fine.

The current approach does not block using a UDF handler. I think the only real difference is that in my approach the data for all UDFs is packed together and crosses the C++/Python boundary once. Given this data, one can write a loop that calls any UDF handler on any of the UDF records, with optional record filtering and other such enhancements if needed. This would be an alternative to the current behavior you described as "Pyarrow then unpickles and registers those UDFs"; I don't think this needs to be implemented right away, but I'm open to arguments in favor.

westonpace · 2022-07-07T06:40:53Z

For the purposes of a consumer (Acero) I would say summary and description are superfluous. I can see them being useful other components (e.g. plan visualizer) though.

I don't think we need to be perfectly aligned with Substrait to start with so this is fine. New features should start as extensions and move into Substrait once there is some proven usage.

rtpsw · 2022-07-07T10:03:34Z

Great, I'll shift my focus to the Substrait PR. Feel free to review at a lower priority until I notify here that I'm done with that PR.

pitrou

I'm not sure I understand whether this PR is finished or not (I see some code commented out). If not, can you please make it draft?

I am not an expert in this code, but here are some comments.

pitrou · 2022-07-11T15:04:52Z

cpp/src/arrow/compute/registry_util.cc

+namespace compute {
+
+std::unique_ptr<FunctionRegistry> MakeFunctionRegistry() {
+  return FunctionRegistry::Make(GetFunctionRegistry());


What is the point of this mostly trivial function? Why not let the user call FunctionRegistry::Make directly?

pitrou · 2022-07-11T15:05:47Z

cpp/src/arrow/compute/exec/options.h

  std::shared_ptr<Table>* output_table;
 };

-/// @}


You shouldn't remove this, this matches the opening brace in \addtogroup execnode-options above.

pitrou · 2022-07-11T15:06:14Z

cpp/src/arrow/engine/substrait/plan_internal.h

    const substrait::Plan& plan,
-    const ExtensionIdRegistry* registry = default_extension_id_registry());
+    const ExtensionIdRegistry* registry = default_extension_id_registry(),
+    bool exclude_functions = false);


Can you add documentation for this parameter in the docstring above?

Also, as a nit, double negatives are not terrific, so I would instead suggest bool include_functions = true.

pitrou · 2022-07-11T15:07:30Z

cpp/src/arrow/engine/substrait/relation_internal.cc

+Result<std::vector<FieldRef>> FromProto(
+    const google::protobuf::RepeatedPtrField<substrait::Expression>& exprs,
+    const std::string& what) {
+  std::vector<FieldRef> fields;


May want to presize this?

pitrou · 2022-07-11T15:07:49Z

cpp/src/arrow/engine/substrait/relation_internal.cc

+  int size = exprs.size();
+  for (int i = 0; i < size; i++) {
+    ARROW_ASSIGN_OR_RAISE(FieldRef field, FromProto(exprs[i], what));
+    fields.push_back(field);


Suggested change

fields.push_back(field);

fields.push_back(std::move(field));

pitrou · 2022-07-11T15:42:14Z

python/pyarrow/_substrait.pyx

+from pyarrow._compute import make_function_registry
+
+
+def make_extension_id_registry():


Public APIs should get a docstring.

pitrou · 2022-07-11T15:44:15Z

python/pyarrow/_substrait.pyx

+    return execplan([], output_type, c_decls, True, c_func_registry)
+
+
+def run_query(plan, extid_registry, func_registry):


Should func_registry be optional?

Suggested change

def run_query(plan, extid_registry, func_registry):

def run_query(plan, extid_registry, func_registry=None):

Also, what about extid_registry? Are there cases where it could be omitted? @westonpace

pitrou · 2022-07-11T15:46:43Z

python/pyarrow/lib.pyx

 include "table.pxi"

+# Compute registries
+include "compute.pxi"


For the record, why is this addition necessary? Can Substrait instead directly import these declarations?

pitrou · 2022-07-11T15:47:20Z

python/pyarrow/tests/test_substrait.py

 import pyarrow as pa
 from pyarrow.lib import tobytes
 from pyarrow.lib import ArrowInvalid
+from pyarrow.substrait import make_extension_id_registry


It's not useful at all, is it? Can just call substrait.make_extension_id_registry?

pitrou · 2022-07-11T15:47:50Z

python/pyarrow/tests/test_substrait.py

-    reader = substrait.run_query(buf)
+    extid_registry = substrait.make_extension_id_registry()
+    func_registry = substrait.make_function_registry()
+    reader = substrait.run_query(buf, extid_registry, func_registry)


Can there also be a test with func_registry omitted or None?

rtpsw · 2022-07-11T20:28:15Z

I'm not sure I understand whether this PR is finished or not (I see some code commented out). If not, can you please make it draft?

I converted to a draft. However, this is fairly mature code in the sense that I tested locally. What mostly keeps this PR from being ready for review is pending changes elsewhere that it depends on. See this explanation post, including about the commented-out code.

rtpsw · 2022-11-27T11:01:51Z

For reference, the following issues were created as a follow-up:

github-actions · 2025-11-18T11:23:35Z

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

rtpsw added 25 commits March 11, 2022 10:10

Substrait integrations

716a5b9

Substrait integrations

8327b67

Merge branch 'rtpsw-x1' of https://github.com/rtpsw/arrow into rtpsw-x1

d054f2a

Added end-to-end Substrait-to-Arrow enhancements

330ae66

Added logical comparison operators to Substrait registry

abee905

Added as-of-merge execution

3f3f3ef

Added Substrait deserialization of flat field references for AsOfMerge

98d2663

Support write-consumer of Arrow Substrait plan

5aa7ede

Added explanation comment on MakeWriteNode

c0c0d08

Set use_threads on scan options of Arrow Substrait

f202dc5

try

a912ea5

merge rtpsw-x1 and fix

b8e56bc

Merge branch 'master' into rtpsw-x2

5b9025b

integrated and tested

dbacb0a

UDF PoC

f49a85d

merge master to rtpsw-x2

4eba11f

UDF PoC with scoped registries

5795a86

Fix parameter order and doc of DeserializePlan functions

90f20d0

Merge branch 'master' into rtpsw-x2

908862a

fix registry scoping

879999e

simple UDF benchmark

a15e0ca

improved UDF PoC benchmark

85bbaf4

add substrait tests

394676a

Merge branch 'master' into rtpsw-x2

11d59a2

ARROW-16968: [C++] Expand Python-UDF support to Arrow Substrait

2a7386e

github-actions bot added Component: C++ Component: Python labels Jul 3, 2022

rtpsw changed the title ~~Arrow 16968: [C++] Expand Python-UDF support to Arrow Substrait~~ ARROW 16968: [C++] Expand Python-UDF support to Arrow Substrait Jul 3, 2022

lint

1b1fdde

vibhatha reviewed Jul 4, 2022

View reviewed changes

westonpace self-requested a review July 5, 2022 16:18

pitrou reviewed Jul 11, 2022

View reviewed changes

rtpsw marked this pull request as draft July 11, 2022 20:24

westonpace removed their request for review September 2, 2022 19:41

github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025

github-actions bot closed this Dec 6, 2025

		from pyarrow._compute import make_function_registry


		def make_extension_id_registry():

		return execplan([], output_type, c_decls, True, c_func_registry)


		def run_query(plan, extid_registry, func_registry):

ARROW 16968: [C++] Expand Python-UDF support to Arrow Substrait #13500

ARROW 16968: [C++] Expand Python-UDF support to Arrow Substrait #13500

Uh oh!

Conversation

rtpsw commented Jul 3, 2022

Uh oh!

github-actions bot commented Jul 3, 2022

Uh oh!

rtpsw commented Jul 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vibhatha commented Jul 4, 2022

Uh oh!

rtpsw commented Jul 4, 2022

Uh oh!

rtpsw commented Jul 6, 2022

Uh oh!

westonpace commented Jul 6, 2022

Uh oh!

rtpsw commented Jul 6, 2022

Uh oh!

westonpace commented Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtpsw commented Jul 7, 2022

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtpsw commented Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtpsw commented Jul 3, 2022 •

edited

Loading

westonpace commented Jul 7, 2022 •

edited

Loading

rtpsw commented Jul 11, 2022 •

edited

Loading

rtpsw commented Nov 27, 2022 •

edited

Loading