Skip to content

Conversation

@rtpsw
Copy link
Contributor

@rtpsw rtpsw commented May 22, 2022

See this post.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@rtpsw
Copy link
Contributor Author

rtpsw commented May 22, 2022

cc @icexelloss

@icexelloss
Copy link
Contributor

registering Python UDFs to an extension registry instance that (1) is specific to the Python interpreter and (2) is linked to the default global one (so it can find both UDF and normal functions). This Python-specific registry would then be passed to be used by the execution engine. I think this way (only) the Python-specific registry would naturally get cleaned up on finalization of the Python interpreter.

I think high level this makes sense as @westonpace previously mentioned in email discussion about an local/temporary registry for a specific substrait plan execution. However, I am not expert on this matter so curious to hear about how @westonpace and @vibhatha think.

@rtpsw
Copy link
Contributor Author

rtpsw commented May 23, 2022

If this makes sense to other people, I could also try to make a similar separate solution for the function registry.

@icexelloss
Copy link
Contributor

@westonpace I wonder what's your thoughts about the changes here? Is this on the right track?

@vibhatha
Copy link
Collaborator

@rtpsw Is the PR associated with the correct JIRA?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concept is a good idea. I think @sanjibansg is close to having a PR for ARROW-15582 which changes the extension id registry a bit. However, these two PRs are addressing orthogonal concerns so it should be pretty straightforward to rebase whichever one ends up being second.

What would you envision the lifetime of the custom extension id registry being? Based on some of the comments it sounds like you might be thinking per-plan.

For a plan-specific "embedded UDF" (i.e. a UDF that has been pickled and serialized as part of the plan) I though the idea was that the embedded UDF would have some special way of being inserted into an expression (admittedly, the protobuf is lacking for this part at the moment) and so it wouldn't need to be a part of the extension set.

For UDFs that are not embedded in the plan itself it seems the user would probably want to just register those once so I think the lifetime of this nested set would still be process-scoped.

The proposed change works fine for process-scoped but I want to make sure I understand the intended usage.

Lastly, we should have some unit tests in place before we merge this.

};
virtual util::optional<TypeRecord> GetType(const DataType&) const = 0;
virtual util::optional<TypeRecord> GetType(Id, bool is_variation) const = 0;
virtual Status CanRegisterType(Id, std::shared_ptr<DataType> type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
virtual Status CanRegisterType(Id, std::shared_ptr<DataType> type,
virtual Status CanRegisterType(Id, const std::shared_ptr<DataType>& type,

virtual util::optional<FunctionRecord> GetFunction(Id) const = 0;
virtual util::optional<FunctionRecord> GetFunction(
util::string_view arrow_function_name) const = 0;
virtual Status CanRegisterFunction(Id, std::string arrow_function_name) const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
virtual Status CanRegisterFunction(Id, std::string arrow_function_name) const = 0;
virtual Status CanRegisterFunction(Id, const std::string& arrow_function_name) const = 0;


private:
ExtensionIdRegistry* registry_;
const ExtensionIdRegistry* registry_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason to change this to a const? I think it's fine but it seems unrelated to the change so I wanted to make sure I wasn't missing anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is an extraction from a larger project I'm working on, and I just wanted the compiler to ensure that no unintended modification to the extension-id-registry occurs via this class. I don't mind so much keeping or removing the const modifier in this PR.

ARROW_ENGINE_EXPORT Result<std::shared_ptr<Buffer>> SerializeJsonPlan(
const std::string& substrait_json);

ARROW_ENGINE_EXPORT std::shared_ptr<ExtensionIdRegistry> MakeExtensionIdRegistry();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably start documenting the functions in this file. At least a brief reference to the nice docs you have in extension_set.h

@rtpsw
Copy link
Contributor Author

rtpsw commented May 25, 2022

@rtpsw Is the PR associated with the correct JIRA?

Yeah, this is by mistake. I'll move it soon.

@rtpsw
Copy link
Contributor Author

rtpsw commented May 25, 2022

What would you envision the lifetime of the custom extension id registry being? Based on some of the comments it sounds like you might be thinking per-plan.

Either or both per-process and per-plan. The scope can be controlled by the user (from Arrow or PyArrow):

  • For per-process scope, the user makes per_process_registry = nested_extension_id_registry(default_extension_id_registry()) and keeps it throughout the lifetime of the process.
  • For per-plan scope, the user recreates per_plan_registry = nested_extension_id_registry(default_extension_id_registry()) for each plan,
  • For both per-process and per-plan scope, the user makes per_process_registry = nested_extension_id_registry(default_extension_id_registry()) and recreates per_plan_registry = nested_extension_id_registry(per_process_registry) (note the double-nesting) for each plan.

For a plan-specific "embedded UDF" (i.e. a UDF that has been pickled and serialized as part of the plan) I though the idea was that the embedded UDF would have some special way of being inserted into an expression (admittedly, the protobuf is lacking for this part at the moment) and so it wouldn't need to be a part of the extension set.

Locally, I was able to get this to work for UDFs restricted to element-wise flat-scalar-valued (not analytic or reduction, not struct, not table or dataset).

For UDFs that are not embedded in the plan itself it seems the user would probably want to just register those once so I think the lifetime of this nested set would still be process-scoped.

My local solution involves an Ibis/Substrait/Arrow workflow, where all UDFs are registered once and get serialized into each Substrait plan they are used in.

The proposed change works fine for process-scoped but I want to make sure I understand the intended usage.

As explained above, I think it works more broadly.

Lastly, we should have some unit tests in place before we merge this.

Agreed. I'll move on to add them, now that this approach seems to be acceptable.

@rtpsw
Copy link
Contributor Author

rtpsw commented May 25, 2022

This PR is replaced by #13232

@westonpace westonpace closed this May 26, 2022
@rtpsw rtpsw deleted the ARROW-15635 branch June 1, 2022 09:42
westonpace pushed a commit that referenced this pull request Jun 2, 2022
Replacing #13214

Lead-authored-by: Yaron Gvili <[email protected]>
Co-authored-by: rtpsw <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
@asfimport asfimport mentioned this pull request Oct 25, 2022
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants