ARROW-11929: [C++][Dataset][Compute] Promote expression to the compute namespace #10166

bkietz · 2021-04-26T19:31:35Z

Moves Expression and its test and benchmark into the compute/exec/ directory. I haven't introduced an exec namespace.

github-actions · 2021-04-26T19:31:55Z

https://issues.apache.org/jira/browse/ARROW-11929

lidavidm

LGTM. One nit about something that was recently introduced (I assume it just got lost in the rebasing).

I'll hold off on merging datasets stuff until this is through so we're not stuck in an endless rebase cycle.

lidavidm · 2021-04-26T19:33:56Z

cpp/examples/arrow/dataset_documentation_example.cc

I'll follow up and adjust the line numbers in the corresponding reST file (and see if I can figure out a better way to excerpt code snippets than hardcoding line numbers).

https://issues.apache.org/jira/browse/ARROW-12605

cpp/src/arrow/dataset/file_parquet_test.cc

lidavidm · 2021-04-26T19:46:49Z

Ah, looks like clang-format/cmake-format are unhappy, and the CMake magic that makes the new header gets installed is missing.

nealrichardson · 2021-04-27T16:14:45Z

This just moves the code? There's no change in how or where expressions are evaluated yet, correct? I.e. they're still only for datasets here, they're just in a different namespace?

bkietz · 2021-04-27T17:36:41Z

TODO: update bindings

lidavidm

Sorry, I didn't notice this got rebased yesterday (feel free to ping me). This looks good, just needs to be linted/formatted, and it looks like compute/expression_benchmark.cc needs to be fixed. The AppVeyor failure is unrelated.

nealrichardson

R changes LGTM

lidavidm

LGTM, thanks!

ianmcook · 2021-04-30T13:27:33Z

@bkietz This caused a failure in the test-r-minimal-build nightly test (that's the test where we build the C++ library and the R package with Dataset and Parquet switched off):
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=4423&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=952
Do you know off the top of your head how to resolve this? Thanks

lidavidm · 2021-04-30T13:28:51Z

@ianmcook I think we just need to add Expression to compute/type_fwd.h. I'll do this now.

@bkietz

Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats: * You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved. * with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine. There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights: * 5b501c5 is the main switch to use InMemoryDataset * b31fb5e deletes `array_expression` * 0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions * 2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz * d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface). * a0914f6 + eee491a contain ARROW-12696 Closes #10191 from nealrichardson/dplyr-in-memory Authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

bkietz requested a review from lidavidm April 26, 2021 19:31

github-actions bot added the Component: C++ label Apr 26, 2021

lidavidm approved these changes Apr 26, 2021

View reviewed changes

bkietz added 3 commits April 28, 2021 11:03

ARROW-11929: [C++][Dataset][Compute] Move Expression into compute::

3b70923

namespace fixing

472b90f

repair bindings

874b9fe

bkietz force-pushed the 11929-Promote-Expression-to-the branch from 5ecf4e2 to 874b9fe Compare April 28, 2021 20:38

github-actions bot added Component: Python Component: R labels Apr 28, 2021

nealrichardson mentioned this pull request Apr 28, 2021

ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code #10191

Closed

lidavidm reviewed Apr 29, 2021

View reviewed changes

nealrichardson approved these changes Apr 29, 2021

View reviewed changes

bkietz added 3 commits April 29, 2021 15:44

lint, format

51c45d6

fix benchmarks

2d03c5a

remove redundant SetFilter

e550100

lidavidm approved these changes Apr 29, 2021

View reviewed changes

lidavidm closed this in 7430bbd Apr 30, 2021

bkietz deleted the 11929-Promote-Expression-to-the branch April 30, 2021 14:39

asfimport mentioned this pull request Apr 30, 2021

[C++][Compute] Promote Expression to the compute namespace #27766

Closed

asfimport mentioned this pull request May 4, 2021

[C++] Fix errors from VS 2019 in cpp/src/parquet/types.h #28392

Closed

ARROW-11929: [C++][Dataset][Compute] Promote expression to the compute namespace #10166

ARROW-11929: [C++][Dataset][Compute] Promote expression to the compute namespace #10166

Uh oh!

Conversation

bkietz commented Apr 26, 2021

Uh oh!

github-actions bot commented Apr 26, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

lidavidm Apr 26, 2021

Choose a reason for hiding this comment

Uh oh!

bkietz Apr 29, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lidavidm commented Apr 26, 2021

Uh oh!

nealrichardson commented Apr 27, 2021

Uh oh!

bkietz commented Apr 27, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

ianmcook commented Apr 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Apr 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ianmcook commented Apr 30, 2021 •

edited

Loading