-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15635: [C++] Support nested extension-id-registry #13214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
cc @icexelloss |
I think high level this makes sense as @westonpace previously mentioned in email discussion about an local/temporary registry for a specific substrait plan execution. However, I am not expert on this matter so curious to hear about how @westonpace and @vibhatha think. |
|
If this makes sense to other people, I could also try to make a similar separate solution for the function registry. |
|
@westonpace I wonder what's your thoughts about the changes here? Is this on the right track? |
|
@rtpsw Is the PR associated with the correct JIRA? |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the concept is a good idea. I think @sanjibansg is close to having a PR for ARROW-15582 which changes the extension id registry a bit. However, these two PRs are addressing orthogonal concerns so it should be pretty straightforward to rebase whichever one ends up being second.
What would you envision the lifetime of the custom extension id registry being? Based on some of the comments it sounds like you might be thinking per-plan.
For a plan-specific "embedded UDF" (i.e. a UDF that has been pickled and serialized as part of the plan) I though the idea was that the embedded UDF would have some special way of being inserted into an expression (admittedly, the protobuf is lacking for this part at the moment) and so it wouldn't need to be a part of the extension set.
For UDFs that are not embedded in the plan itself it seems the user would probably want to just register those once so I think the lifetime of this nested set would still be process-scoped.
The proposed change works fine for process-scoped but I want to make sure I understand the intended usage.
Lastly, we should have some unit tests in place before we merge this.
| }; | ||
| virtual util::optional<TypeRecord> GetType(const DataType&) const = 0; | ||
| virtual util::optional<TypeRecord> GetType(Id, bool is_variation) const = 0; | ||
| virtual Status CanRegisterType(Id, std::shared_ptr<DataType> type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| virtual Status CanRegisterType(Id, std::shared_ptr<DataType> type, | |
| virtual Status CanRegisterType(Id, const std::shared_ptr<DataType>& type, |
| virtual util::optional<FunctionRecord> GetFunction(Id) const = 0; | ||
| virtual util::optional<FunctionRecord> GetFunction( | ||
| util::string_view arrow_function_name) const = 0; | ||
| virtual Status CanRegisterFunction(Id, std::string arrow_function_name) const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| virtual Status CanRegisterFunction(Id, std::string arrow_function_name) const = 0; | |
| virtual Status CanRegisterFunction(Id, const std::string& arrow_function_name) const = 0; |
|
|
||
| private: | ||
| ExtensionIdRegistry* registry_; | ||
| const ExtensionIdRegistry* registry_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular reason to change this to a const? I think it's fine but it seems unrelated to the change so I wanted to make sure I wasn't missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is an extraction from a larger project I'm working on, and I just wanted the compiler to ensure that no unintended modification to the extension-id-registry occurs via this class. I don't mind so much keeping or removing the const modifier in this PR.
| ARROW_ENGINE_EXPORT Result<std::shared_ptr<Buffer>> SerializeJsonPlan( | ||
| const std::string& substrait_json); | ||
|
|
||
| ARROW_ENGINE_EXPORT std::shared_ptr<ExtensionIdRegistry> MakeExtensionIdRegistry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably start documenting the functions in this file. At least a brief reference to the nice docs you have in extension_set.h
Yeah, this is by mistake. I'll move it soon. |
Either or both per-process and per-plan. The scope can be controlled by the user (from Arrow or PyArrow):
Locally, I was able to get this to work for UDFs restricted to element-wise flat-scalar-valued (not analytic or reduction, not struct, not table or dataset).
My local solution involves an Ibis/Substrait/Arrow workflow, where all UDFs are registered once and get serialized into each Substrait plan they are used in.
As explained above, I think it works more broadly.
Agreed. I'll move on to add them, now that this approach seems to be acceptable. |
|
This PR is replaced by #13232 |
Replacing #13214 Lead-authored-by: Yaron Gvili <[email protected]> Co-authored-by: rtpsw <[email protected]> Signed-off-by: Weston Pace <[email protected]>
See this post.