Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Apr 26, 2021

Moves Expression and its test and benchmark into the compute/exec/ directory. I haven't introduced an exec namespace.

@bkietz bkietz requested a review from lidavidm April 26, 2021 19:31
@github-actions
Copy link

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One nit about something that was recently introduced (I assume it just got lost in the rebasing).

I'll hold off on merging datasets stuff until this is through so we're not stuck in an endless rebase cycle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow up and adjust the line numbers in the corresponding reST file (and see if I can figure out a better way to excerpt code snippets than hardcoding line numbers).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm
Copy link
Member

Ah, looks like clang-format/cmake-format are unhappy, and the CMake magic that makes the new header gets installed is missing.

@nealrichardson
Copy link
Member

This just moves the code? There's no change in how or where expressions are evaluated yet, correct? I.e. they're still only for datasets here, they're just in a different namespace?

@bkietz
Copy link
Member Author

bkietz commented Apr 27, 2021

TODO: update bindings

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't notice this got rebased yesterday (feel free to ping me). This looks good, just needs to be linted/formatted, and it looks like compute/expression_benchmark.cc needs to be fixed. The AppVeyor failure is unrelated.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R changes LGTM

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@lidavidm lidavidm closed this in 7430bbd Apr 30, 2021
@ianmcook
Copy link
Member

ianmcook commented Apr 30, 2021

@bkietz This caused a failure in the test-r-minimal-build nightly test (that's the test where we build the C++ library and the R package with Dataset and Parquet switched off):
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=4423&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=952
Do you know off the top of your head how to resolve this? Thanks

@lidavidm
Copy link
Member

@ianmcook I think we just need to add Expression to compute/type_fwd.h. I'll do this now.

@bkietz bkietz deleted the 11929-Promote-Expression-to-the branch April 30, 2021 14:39
nealrichardson added a commit that referenced this pull request May 13, 2021
Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:

* You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved.
* with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.

There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:

* 5b501c5 is the main switch to use InMemoryDataset
* b31fb5e deletes `array_expression`
* 0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions
* 2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz
* d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface).
* a0914f6 + eee491a contain ARROW-12696

Closes #10191 from nealrichardson/dplyr-in-memory

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants