Skip to content

Conversation

@romainfrancois
Copy link
Contributor

elements of ... that are unnamed in record_batch(...) or table(...) are automatically spliced:

library(arrow, warn.conflicts = FALSE)
library(tibble)

df <- tibble(x = 1:10, y = letters[1:10])
# auto splicing
batch <- record_batch(df, z = 1:10)
as_tibble(batch)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

# same as explicit splicing
batch <- record_batch(!!!df, z = 1:10)
as_tibble(batch)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10
# auto splicing
tab <- table(df, z = 1:10)
as_tibble(tab)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

# same as explicit splicing
tab <- table(!!!df, z = 1:10)
as_tibble(tab)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

In particular, this gives us back record_batch(<data.frame>) and table(<data.frame>) :

library(arrow, warn.conflicts = FALSE)
library(tibble)

df <- tibble(x = 1:10, y = letters[1:10])
record_batch(df)
#> arrow::RecordBatch
table(df)
#> arrow::Table

@romainfrancois romainfrancois force-pushed the ARROW-5718/df_auto_splice branch from 663730f to 61902ab Compare June 26, 2019 12:25
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of notes. Generally looks good, might like to see a couple more tests.


R_xlen_t n = XLENGTH(lst);
std::vector<std::shared_ptr<arrow::Column>> columns(n);
int num_fields;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why the record_batch and table code isn't more shared

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be. record batch only have to handle arrays, where tables have to handle arrays, chunked arrays, columns. but I agree that the structure of the code is similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refactor the flatten-for loop into a small function if they're all the same.

#' @export
record_batch <- function(..., schema = NULL){
arrays <- list2(...)
# making sure there are always names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but why? IIUC this has to do with how the cpp code distinguishes things to autosplice, switching behavior on nchar(name). That's a subtlety I'll probably forget by Monday so it seems worth explaining.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names(list(a = 1, b = 2))
#> [1] "a" "b"
names(list(a = 1, 2))
#> [1] "a" ""
names(list(1, 2))
#> NULL

otherwise we'd have to treat the NULL case differently internally as we did before, maybe I could do that instead.

arrays[j] = arrow::r::Array__from_vector(x, schema->field(j)->type(), false);
};

for (R_xlen_t i = 0, j = 0; j < num_fields; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you simplify the loop conditions? Make it only iterate over i and XLENGTH(lst) and use an external variable,e.g.

int out_idx = 0;
for (R_xlen_t i = 0; i < XLENGHT(lst); i++) {
  ...
  if (..) {
    for (...)
      fill_array(out_idx++, VECTOR_ELT(x_i, k), STRING_ELT(names_x_i, k));
  } else {
    fill_array(out_idx++, x_i, name_i);
  }
}
assert(out_idx == num_fields);

The loop is now invariant on j (out_idx in example) Easier to follow and review/refactor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus point, get rid of all of this and make a C++ iterator over lst that will behave as-if "flattened".

std::vector<std::shared_ptr<arrow::Field>> fields(n_arrays);
for (R_xlen_t i = 0; i < n_arrays; i++) {
fields[i] = std::make_shared<arrow::Field>(std::string(names[i]), arrays[i]->type());
std::vector<std::shared_ptr<arrow::Field>> fields(num_fields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like Schema::Make(const std::vector<Array>, const std::vector<string>) could be a convenient method to have globally.


R_xlen_t n = XLENGTH(lst);
std::vector<std::shared_ptr<arrow::Column>> columns(n);
int num_fields;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refactor the flatten-for loop into a small function if they're all the same.

@wesm
Copy link
Member

wesm commented Jun 27, 2019

What's the status on this? Since we're at the 0.14.0 end game we probably need to get this to merge readiness or defer to next release in the next 12-24 hours

@nealrichardson
Copy link
Member

This should go in 0.14, otherwise we'll have to facilitate a sparklyr workaround. Plus it's the right interface to expose.

IMO the feedback from François could be addressed in a followup ticket.

@fsaintjacques
Copy link
Contributor

I also think it's more important to get this in (seems like a very convenient feature), than have a simplified loop.

@nealrichardson
Copy link
Member

The appveyor failure was the r-project.org DNS outage from yesterday. I just checked out this PR locally and ran R CMD check and it was clean. Created https://issues.apache.org/jira/browse/ARROW-5761 for the followup. @fsaintjacques would you please merge?

@fsaintjacques
Copy link
Contributor

I will run the docker-compose R image locally, should be completed in 5-10 minutes and will merge if successful.

@fsaintjacques
Copy link
Contributor

docker image failed, but unrelated to this change, I'll open a ticket, but merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants