Skip to content

Conversation

@pcmoritz
Copy link
Contributor

Thanks a lot to @crystalzyan who did all the heavy lifting for this PR!

Copy link
Contributor Author

@pcmoritz pcmoritz Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would use get_record_batch_size here, but it doesn't account for the metadata.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have to look through the rest in more detail later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we might merge these with the more general build instructions? Though I suspect that most users will obtain the plasma client via pip or conda packages

@crystalzyan
Copy link

Hey Philipp,

Five things came up when I was reviewing plasma.rst (the plasma python tutorial):

Mac OS X Installation Instructions Don't Work For Me

... I still don't know what's wrong with my mac, but trying to follow the installation instructions for plasma still don't work. I get this error when trying to import plasma:

111 pyarrow plasma import error

I've tried the installation instructions with both your pcmoritz/arrow plasma-cython branch, and the actual apache/arrow repo, I did update my dependency packages, but this still happens. Did something go wrong in the install pyarrow + plasma step?

Also, there's a paragraph in the Mac OS X Installation Instructions I had left which goes something like:

Plasma also requires the build-essential, curl, unzip, libboost-all-dev, and libjemalloc-dev packages. MacOS should already come with curl, unzip, and the compilation tools found in build-essential.

This was honestly more of a note to myself than anything, so I'm not sure if this paragraph is still necessary (it might confuse users).

Does PlasmaClient.get Take in only Lists?

I noticed that in the tutorial, all calls made to PlasmaClient.get seem to require list brackets for the argument and return result:

[buffer2] = client2.get([object_id])

It's a little unexpected, but this is how the method's syntax behaves, correct? If so, we could maybe add a sentence in the Getting an Object section that the PlasmaClient.get method only takes in/outputs lists. (in contrast to Ray.get).

Also, if it's like ray.get in that it can get multiple object ids at once, we might want to include a code example of that capability:

Note that client.get takes in the single argument object_id in a list, and outputs the single plasma object in a list. This is because the syntax for client.get supports getting multiple Object IDs as well. To get multiple objects at once, you would similarly pass in-and-out the objects as a list, like follows:

[buffer_A, buffer_B] = client2.get([object_id_A, object_id_B])

Reword the Timeout Explanation

Under the Getting an Object section, you included a mention of the timeout_ms argument for the PlasmaClient.get function:

If the object has not been sealed yet, then the call to client.get will block until the object has been sealed by the client constructing the object. Using the timeout_ms argument to get, you can specify a timeout for this (in milliseconds). After the timeout, the interpreter will yield control back.

These last two sentences are very brief and do not show a code example of the syntax of passing timeout_ms argument, which I would suggest to add as a comparison to a normal call to PlasmaClient.get.

Also, I'll mention that I actually found these two sentences confusing at first, since the word get wasn't even highlighted as code or written out in full, so I thought that it was part of the grammar of the sentence and that it wasn't a mention to the Python function PlasmaClient.get.

We could instead do something like:

If the object has not been sealed yet, then the call to client.get will block until the object has been sealed by the client constructing the object. However, we can limit how long client.get can block by passing in an optional timeout_ms argument.

By setting timeout_ms, we specify a timeout for this function call (in milliseconds). This timeout will force the interpreter to exit client.get early (regardless of success) if the function takes longer than timeout_ms milliseconds. Here is an example of using timeout_ms with client.get:

[buffer2] = client2.get([object_id], timeout_ms=100) // This function will timeout in 100 ms

Pandas Reference Link Broken

In the Storing Pandas DataFrames in Plasma section, I had originally included an rst link to the Using PyArrow with Pandas page of the arrow documentation. This was to let users know that they could check out the conversion charts between pandas and Arrow:

One can instead use pyarrow and its supportive API as an intermediary step to import the Pandas DataFrame into Plasma. Arrow has multiple equivalent types to the various Pandas structures, see the :ref:pandas page for more.

However, this :ref: link is currently broken, since the corresponding link anchor I put in pandas.rst has been removed. We should remove this broken link reference entirely, then.

Include One-Liners for Converting Plasma Objects Back to Arrow/Pandas

This is just an idea for the sake of convenience, but after we explain the users the conversion steps for PlasmaBuffer -> Arrow reader -> Arrow tensor -> numpy array in Getting Arrow Objects from Plasma (similarly, PlasmaBuffer -> Arrow BufferReader -> Arrow RecordBatchStreamReader -> Arrow RecordBatch -> Pandas DataFrame in Getting Pandas DataFrames from Plasma), we could also provide an equivalent condensed one-liner for the code example. This is so to show that all the conversion steps aren't really that intimidating or difficult:

For Arrow:

We can condense the entire procedure described above into one-liners as follows:

# Get the arrow object by ObjectID.
[buf2] = client.get([object_id])

# Equivalent one-liner to convert Plasma buffer back to Arrow tensor
tensor2 = pa.read_tensor(pa_BufferReader(buf2))

# Equivalent one-liner to convert Plasma buffer back to numpy array
array = pa.read_tensor(pa_BufferReader(buf2)).to_numpy()

For Pandas:

The above conversion procedures may seem lengthy, but we can put them all together into a one-liner as follows:

# Fetch the Plasma object
[data] = client.get([object_id])

# Equivalent one-liner to convert Plasma buffer back to Pandas dataFrame
result = pa.RecordBatchStreamReader(pa.BufferReader(data)).read_next_batch().to_pandas()

crystalzyan and others added 16 commits July 31, 2017 20:39
…ontents header at top, minor tweaks to Linux Installation section. Still need to do Installation on Mac OS and storing Arrow/Panda in Plasma
…t to 'Getting an Object' subsection in Plasma API.
…for Starting the Object Store, Creating Clients, Creating Objects, Getting Objects, Transferring to Remote Stores, Querying Status, Releasing Objects, and Shutting Down Clients and Stores. Basically all of the PlasmaClient API. Warning- I could not get C++ running on my machine to verify that any of the code runs properly/works. Please verify all code and tutorial content
@robertnishihara
Copy link
Contributor

I pushed a few small changes.

This looks good to me, nice job @crystalzyan :)

@pcmoritz pcmoritz changed the title [WIP] ARROW-1257: Plasma documentation ARROW-1257: Plasma documentation Aug 1, 2017
the Plasma store in this case, issue the command below:

```shell
killall plasma_store &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is killall sent to the background?

Alternatively, you can run the Plasma store in the background and ignore all
message output with the following terminal command:

```shell
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does using these annotations work in your doxygen version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shell ones aren't obviously doing anything (want me to remove them?), but the cpp ones definitely help. Using doxygen 1.8.13.

I can remove the cpp ones also if you prefer.

Copy link
Member

@wesm wesm Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked this out locally with doxygen 1.8.13. The rendered output looks OK, but it doesn't seem like shell is supported by the rendering engine (I tried tracking down where Doxygen's support for GH-flavored markdown is coming from but couldn't find anything conclusive).

The C++ looks good though so definitely leave that =)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I removed the shell keyword.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, very nice. thanks all!

@asfgit asfgit closed this in 7e7861c Aug 2, 2017
@robertnishihara robertnishihara deleted the plasma-docs branch August 2, 2017 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants