-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1412: [Plasma] Add higher level API for putting and getting Python objects #995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0886ac0 to
44c3b3d
Compare
python/pyarrow/plasma.pyx
Outdated
| release_delay, num_retries)) | ||
| return result | ||
|
|
||
| def put(PlasmaClient client, value, object_id=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to specify types here for value like list value (since that's the only thing supported for now), and ObjectID object_id=None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to make these methods on the PlasmaClient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why these are not methods on the PlasmaClient is that there is already a method get on there, which gets the underlying Buffers. If that's preferable we can rename that method to get_buffers and have get and put be methods of the PlasmaClient. What do you think?
python/pyarrow/plasma.pyx
Outdated
| client.seal(id) | ||
| return id | ||
|
|
||
| def get(PlasmaClient client, object_ids, timeout_ms=-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout_ms should be given a type of int
python/pyarrow/plasma.pyx
Outdated
| ------- | ||
| The object ID associated to the Python object. | ||
| """ | ||
| cdef ObjectID id = object_id if object_id else ObjectID.from_random() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally try to avoid using id as a variable name since it is also a built in python function
python/pyarrow/plasma.pyx
Outdated
| release_delay, num_retries)) | ||
| return result | ||
|
|
||
| def put(PlasmaClient client, list value, ObjectID object_id=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of forcing the user to pass in a list, why not just allow them to pass in anything and then wrap it in a list?
|
@wesm What do you think about the new API? An alternative that we might want to consider is to leave the PlasmaClient API as it is and add the new API as A potential issue of the current one is that users might try to mix high-level and low-level APIs in inappropriate ways (like call client.create and then client.get on the object_id). I think the Windows build error is unrelated. |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this looks very nice to me as is. I'm not sure what's up with the Appveyor build failure but I opened ARROW-1420 to figure it out
Note that this PR breaks the PlasmaClient API (which is still unstable at this point, so this is acceptable). It renames PlasmaClient.get to PlasmaClient.get_buffers and introduces two new functions, PlasmaClient.put and PlasmaClient.get which can put Python objects into the object store and provide access to their content. The old get was renamed to get_buffers because most users will want to use the new get method and therefore it should have the more concise name.
There is some freedom in designing the API; I tried to make it so there is a unified API between getting one and multiple objects (the latter is supported to limit the number of IPC roundtrips with the plasma store when we get many small objects). I also introduced a special object that is returned if one of the objects was not available within the timeout. We could use "None" here, but then it would be hard to distinguish between getting a "None" object and a timeout.