Skip to content

head() before collect() is not generating the expected ExecPlan #15188

@paleolimbot

Description

@paleolimbot

Describe the bug, including details regarding any error messages, version, and platform.

The helpful ursabot conbench integration noted some regressions after #14518. That was a rather small change that affected usage of head() that I had thought was rare; however, it appears that it actually gets used frequently which was surprising because I had assumed that sink nodes with options (rather than a record batch reader) was supposed to be used.

Some quick inspection suggests that we're generating a nested exec plan for every head() here:

collapse.arrow_dplyr_query(x)

...which means that all of these will go through a record batch reader instead of the appropriate sink node:

arrow/r/R/query-engine.R

Lines 79 to 86 in ec9a8a3

if (is_collapsed(.data)) {
# We have a nested query.
if (has_head_tail(.data$.data)) {
# head and tail are not ExecNodes; at best we can handle them via
# SinkNode, so if there are any steps done after head/tail, we need to
# evaluate the query up to then and then do a new query for the rest.
# as_record_batch_reader() will build and run an ExecPlan
node <- self$SourceNode(as_record_batch_reader(.data$.data))

Reprex:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

as_arrow_table(mtcars) |> 
  arrow:::as_adq() |> 
  head(10) |> 
  show_exec_plan()
#> Warning: The `ExecPlan` cannot be printed for a nested query.

(this query should probably not be nested, but instead be collected with the appropriate sink node options?)

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions