ARROW-16776: [R] dplyr::glimpse method for arrow table and datasets #13563

nealrichardson · 2022-07-10T18:56:37Z

See reprex (sans terminal formatting) in r/tests/testthat/_snaps/dplyr-glimpse.md

Not all queries can be glimpse()d: some would require evaluating the whole query, which may be expensive (and can't be interrupted yet, see ARROW-11841).

Note that the existing print() methods aren't affected by this. There is still the idea that the print methods for Table/RecordBatch should print some data (ARROW-16777 and others), but that should probably be column-oriented instead of row-oriented like glimpse().

github-actions · 2022-07-10T18:56:56Z

https://issues.apache.org/jira/browse/ARROW-16776

wjones127 · 2022-07-11T03:33:33Z

r/R/dplyr-count.R

 #' @importFrom rlang sym :=
 tally.arrow_dplyr_query <- function(x, wt = NULL, sort = FALSE, name = NULL) {
-  check_name <- utils::getFromNamespace("check_name", "dplyr")
+  check_name <- getFromNamespace("check_name", "dplyr")


Question: Is there a reason we didn't want the utils::?

We've elsewhere done importFrom so it's not necessary

nealrichardson · 2022-07-11T13:53:58Z

@jthomasmock FYI

jthomasmock · 2022-07-11T15:16:37Z

@jthomasmock FYI

Very excited to test this out once it's in dev, had peeked through the testthat outputs/snapshots!

wjones127

This looks great. I tried to throw something a little more complex and it did well at formatting the output:

library(arrow)
library(dplyr)

tab <- arrow_table(
    x = Array$create(c(1, 2, 3)),
    extremely_long_name_of_a_column_here = Array$create(list(
        list(data.frame(x = rep("XXXXXXXXXXXXXXXX", 100))),
        list(data.frame(x = rep("YYYYYY", 100))),
        list(data.frame(x = rep("ZZZZZZZZZ", 100)))
    ))
)

glimpse(tab)
#> Table
#> 3 rows x 2 columns
#> $ x                                       <double> 1, 2, 3
#> $ extremely_long_name_of_a_column_here <list<...>> [[<tbl_df[100 x 1]>]], [[<tbl…
#> Call `print()` for full schema details

^{Created on 2022-07-11 by the reprex package (v2.0.1)}

r/R/dplyr-glimpse.R

wjones127 · 2022-07-11T15:33:57Z

r/R/dplyr.R

-has_aggregation <- function(x) {
-  # TODO: update with joins (check right side data too)
-  !is.null(x$aggregations) || (is_collapsed(x) && has_aggregation(x$.data))
+query_can_stream <- function(x) {


The reason we can't push this down to C++ is because we haven't constructed an exec plan yet, right? Otherwise, it would be more maintainable to do so.

I don't follow. We could build an ExecPlan, but it wouldn't tell us anything about how it would perform, would it? I'm trying to detect cases where I can just take head() of the data without having to scan an entire dataset.

We could build an ExecPlan, but it wouldn't tell us anything about how it would perform, would it?

I'm not super close to the ExecPlan code, but I thought they were composed of a graph of nodes that could be traversed and analyzed, just like our arrow_dplyr_query structure. Am I wrong on that?

I'm trying to detect cases where I can just take head() of the data without having to scan an entire dataset.

I was just thinking that having such a method on ExecPlan would be useful in general.

Sure, that probably would be useful

I was just thinking that having such a method on ExecPlan would be useful in general.

Possibly. We'd probably want to define it more formally. SQL has LIMIT X and Substrait's equivalent is FetchRel. Neither of these are exactly what is being detected here. For example, it is legal to have SELECT SUM(x) FROM table LIMIT 1 but it wouldn't actually limit any data being read.

We could define it as "single pipeline queries" but a pipeline breaker doesn't necessarily mean a query is non-streaming (for example, hash-join is sometimes permitted as "streaming" in this example but it is always a pipeline breaker).

Since you mentioned limit, I'll make a plug for ARROW-16628. Not relevant for this particular question, just would let me delete some R specific handling outside of the ExecPlan, and I'm guessing we'll have to do it to support substrait.

Co-authored-by: Will Jones <[email protected]>

r/tests/testthat/_snaps/dplyr-glimpse.md

r/R/chunked-array.R

…Hub issue numbers (#34260) Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature. Issue numbers have been rewritten based on the following correspondence. Also, the pkgdown settings have been changed and updated to link to GitHub. I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly. --- ARROW-6338 #5198 ARROW-6364 #5201 ARROW-6323 #5169 ARROW-6278 #5141 ARROW-6360 #5329 ARROW-6533 #5450 ARROW-6348 #5223 ARROW-6337 #5399 ARROW-10850 #9128 ARROW-10624 #9092 ARROW-10386 #8549 ARROW-6994 #23308 ARROW-12774 #10320 ARROW-12670 #10287 ARROW-16828 #13484 ARROW-14989 #13482 ARROW-16977 #13514 ARROW-13404 #10999 ARROW-16887 #13601 ARROW-15906 #13206 ARROW-15280 #13171 ARROW-16144 #13183 ARROW-16511 #13105 ARROW-16085 #13088 ARROW-16715 #13555 ARROW-16268 #13550 ARROW-16700 #13518 ARROW-16807 #13583 ARROW-16871 #13517 ARROW-16415 #13190 ARROW-14821 #12154 ARROW-16439 #13174 ARROW-16394 #13118 ARROW-16516 #13163 ARROW-16395 #13627 ARROW-14848 #12589 ARROW-16407 #13196 ARROW-16653 #13506 ARROW-14575 #13160 ARROW-15271 #13170 ARROW-16703 #13650 ARROW-16444 #13397 ARROW-15016 #13541 ARROW-16776 #13563 ARROW-15622 #13090 ARROW-18131 #14484 ARROW-18305 #14581 ARROW-18285 #14615 * Closes: #33631 Authored-by: SHIMA Tatsuya <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

github-actions bot added Component: Documentation Component: R labels Jul 10, 2022

nealrichardson added 3 commits July 10, 2022 15:31

Make query_on_dataset() consider all sources

e5d9e71

dplyr::glimpse() and a few assorted cleanups

5537e2f

Better determine when it is safe to glimpse

2382b88

nealrichardson force-pushed the glimpse-part1 branch from fe9867e to 2382b88 Compare July 10, 2022 20:16

wjones127 reviewed Jul 11, 2022

View reviewed changes

Cleanup TODOs and add more comments

c8971a8

nealrichardson marked this pull request as ready for review July 11, 2022 13:47

wjones127 approved these changes Jul 11, 2022

View reviewed changes

Update r/R/dplyr-glimpse.R

c94f61f

Co-authored-by: Will Jones <[email protected]>

nealrichardson commented Jul 11, 2022

View reviewed changes

r/tests/testthat/_snaps/dplyr-glimpse.md Outdated Show resolved Hide resolved

Update r/tests/testthat/_snaps/dplyr-glimpse.md

4d957e7

nealrichardson commented Jul 11, 2022

View reviewed changes

r/R/chunked-array.R Outdated Show resolved Hide resolved

Update r/R/chunked-array.R

3270842

nealrichardson merged commit c6534a5 into apache:master Jul 12, 2022

nealrichardson deleted the glimpse-part1 branch July 12, 2022 19:48

dragosmg mentioned this pull request Aug 2, 2022

ARROW-17084: [R] Install the package before linting #13620

Merged

eitsupi mentioned this pull request Feb 19, 2023

GH-33631: [R] Rewrite Jira ticket numbers in pkgdown documents to GitHub issue numbers #34260

Merged

ARROW-16776: [R] dplyr::glimpse method for arrow table and datasets #13563

ARROW-16776: [R] dplyr::glimpse method for arrow table and datasets #13563

Uh oh!

Conversation

nealrichardson commented Jul 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 10, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented Jul 11, 2022

Uh oh!

jthomasmock commented Jul 11, 2022

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nealrichardson commented Jul 10, 2022 •

edited

Loading