Skip to content

Conversation

@dragosmg
Copy link
Contributor

@dragosmg dragosmg commented May 14, 2022

This PR will allow the use of namespacing with bindings:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

test_df <- tibble(
  date = as.Date(c("2022-03-22", "2021-07-30", NA))
)

test_df %>%
  mutate(ddate = lubridate::as_datetime(date)) %>%
  collect()
#> # A tibble: 3 × 2
#>   date       ddate              
#>   <date>     <dttm>             
#> 1 2022-03-22 2022-03-22 00:00:00
#> 2 2021-07-30 2021-07-30 00:00:00
#> 3 NA         NA

test_df %>%
  arrow_table() %>% 
  mutate(ddate = lubridate::as_datetime(date)) %>%
  collect()
#> # A tibble: 3 × 2
#>   date       ddate              
#>   <date>     <dttm>             
#> 1 2022-03-22 2022-03-22 00:00:00
#> 2 2021-07-30 2021-07-30 00:00:00
#> 3 NA         NA

Created on 2022-05-14 by the reprex package (v2.0.1)

The approach (option 1 from the design doc):

  • add functionality to allow binding registration with the pkg::fun() name;

    • Modify register_binding() to register 2 identical copies for each pkg::fun binding, namely fun and pkg::fun.
    • Add a binding for the :: operator, which helps with retrieving bindings from the function registry.
    • Add generic unit tests for the pkg::fun functionality.
    • Warn for a duplicated binding registration.
  • register nse_funcs requiring indirect mapping

    • register each binding with and without the pkg:: prefix.
    • add / update unit tests for the nse_funcs bindings to include at least one pkg::fun() call for each binding
    unit tests for conditional bindings
    • "dplyr::coalesce"
    • "dplyr::if_else"
    • "base::ifelse"
    • "dplyr::case_when"
    unit tests for date/time bindings
    • "base::strptime"
    • "base::strftime"
    • "lubridate::format_ISO8601"
    • "lubridate::is.Date"
    • "lubridate::is.instant"
    • "lubridate::is.timepoint"
    • "lubridate::is.POSIXct"
    • "lubridate::date"
    • "lubridate::second"
    • "lubridate::wday"
    • "lubridate::week"
    • "lubridate::month"
    • "lubridate::am"
    • "lubridate::pm"
    • "lubridate::tz"
    • "lubridate::semester"
    • "lubridate::make_datetime"
    • "lubridate::make_date"
    • "base::ISOdatetime"
    • "base::ISOdate"
    • "base::as.Date"
    • "lubridate::as_date"
    • "lubridate::as_datetime"
    • "lubridate::decimal_date"
    • "lubridate::date_decimal"
    • "base::difftime"
    • "base::as.difftime"
    • "lubridate::make_difftime"
    • "lubridate::dminutes"
    • "lubridate::dhours"
    • "lubridate::ddays"
    • "lubridate::dweeks"
    • "lubridate::dmonths"
    • "lubridate::dyears"
    • "lubridate::dseconds"
    • "lubridate::dmilliseconds"
    • "lubridate::dmicroseconds"
    • "lubridate::dnanoseconds"
    • "lubridate::dpicoseconds"
    • "lubridate::parse_date_time"
    • "lubridate::ymd"
    • "lubridate::ydm"
    • "lubridate::mdy"
    • "lubridate::myd"
    • "lubridate::dmy"
    • "lubridate::dym"
    • "lubridate::ym"
    • "lubridate::my"
    • "lubridate::yq"
    • "lubridate::fast_strptime"
    unit tests for math bindings
    • "base::log"
    • "base::logb"
    • "base::pmin"
    • "base::pmax"
    • "base::trunc"
    • "base::round"
    • "base::sqrt"
    • "base::exp"
    unit tests for string bindings
    • "base::paste"
    • "base::paste0"
    • "stringr::str_c"
    • "base::grepl"
    • "stringr::str_detect"
    • "stringr::str_like"
    • "stringr::str_count"
    • "base::startsWith"
    • "base::endsWith"
    • "stringr::str_starts"
    • "stringr::str_ends"
    • "base::sub"
    • "base::gsub"
    • "stringr::str_replace"
    • "stringr::str_replace_all"
    • "base::strsplit"
    • "stringr::str_split"
    • "base::nchar"
    • "stringr::str_to_lower"
    • "stringr::str_to_upper"
    • "stringr::str_to_title"
    • "stringr::str_trim"
    • "base::substr"
    • "base::substring"
    • "stringr::str_sub"
    • "stringr::str_pad"
    unit tests for type bindings
    • "base::as.character"
    • "base::as.double"
    • "base::as.integer"
    • "bit64::as.integer64"
    • "base::as.logical"
    • "base::as.numeric"
    • "methods::is"
    • "tibble::tibble"
    • "base::data.frame"
    • "base::is.character"
    • "base::is.numeric"
    • "base::is.double"
    • "base::is.integer"
    • "bit64::is.integer64"
    • "base::is.logical"
    • "base::is.factor"
    • "base::is.list"
    • "rlang::is_character"
    • "rlang::is_double"
    • "rlang::is_integer"
    • "rlang::is_list"
    • "rlang::is_logical"
    • "base::is.na"
    • "base::is.nan"
    • "dplyr::between"
    • "base::is.finite"
    • "base::is.infinite"
    • "base::format"
  • register nse_funcs requiring direct mapping (unary and binary bindings)

    • register unary bindings
    • register binary bindings
    • add / update unit tests for the nse_funcs bindings to include at least one pkg::fun() call for each binding
    Unary and binary bindings unit tests
    • arithmetic functions

      • "base::abs"
      • "base::ceiling"
      • "base::floor"
      • "base::log10"
      • "base::log1p"
      • "base::log2"
      • "base::sign"
    • trigonometric functions

      • "base::acos"
      • "base::asin"
      • "base::cos"
      • "base::sin"
      • "base::tan"
    • string functions

      • "stringr::str_length"
      • "stringi::stri_reverse"
      • "base::tolower"
      • "base::toupper"
    • date and time functions

      • "lubridate::day"
      • "lubridate::dst"
      • "lubridate::hour"
      • "lubridate::isoweek"
      • "lubridate::epiweek"
      • "lubridate::isoyear"
      • "lubridate::epiyear"
      • "lubridate::minute"
      • "lubridate::quarter"
      • "lubridate::mday"
      • "lubridate::yday"
      • "lubridate::year"
      • "lubridate::leap_year"
    • type conversion functions

      • "base::as.factor"
    • binary functions

      • "base::strrep"
      • "stringr::str_dup"
  • aggregating functions

    • register agg_funcs
    • add unit tests for agg_funcs
    unit tests for aggregating bindings
    • "base::sum"
    • "base::any"
    • "base::all"
    • "base::mean"
    • "stats::sd"
    • "stats::var"
    • "stats::quantile"
    • "stats::median"
    • "dplyr::n_distinct"
    • "dplyr::n"
    • "base::min"
    • "base::max"
  • namespace qualified bindings work inside the {dplyr} action verbs:

    • filter()
    • mutate()
    • transmute()
    • group_by()
    • summarise()
  • document changes in the Writing bindings article.

    • going forward we should be using pkg::fun when defining a binding, which will register 2 copies of the same binding.

Bindings that will not be registered with a pkg:: prefix:

  • type casting, such as cast() or dictionary_encode(), and
  • operators (e.g. "!", "==", "!=", ">", ">=", "<", "<=", "&", etc.)

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few thoughts on the approach! This will be very exciting to have implemented.

@dragosmg dragosmg changed the title ARROW-14575: [R] Allow functions with {{pkg::}} prefixes ARROW-14575: [R] Allow functions with pkg:: prefixes May 17, 2022
@dragosmg
Copy link
Contributor Author

dragosmg commented May 17, 2022

It sounds like having a defused version of the :: operator is the way to go (instead of syntax translation). I worked on a proposal for a :: binding yesterday and will push it soon.

Another design choice would be the location for the bindings. Do we want?

  • a single bucket (i.e. nse_funcs) to hold all of them?
  • individual bucket per namespace: nse_funcs[[package]]
  • a mixed approach:
    • a general bucket: nse_funcs[[all_function]] where all bindings live (in their namespace-unqualified form - i.e. as_datetime()
    • package specific buckets: nse_funcs[[package]]

If someone wants to register a pkg::fun binding, it will be recorded in 2 environments, both as nse_funcs[[pkg]][[fun]] and as nse_funcs[[fun]]. If they register fun, it will only be found in use_funcs[[fun]]. We could also add some logic (inspired by Dewey comments on the Jira ticket) to extract the likely namespace from the attached namespaces.

I am in favour of the mixed single-bucket approach , which would make the :: binding something like.

arrow:::register_binding("::", function(lhs, rhs) {
  lhs_name <- as.character(substitute(lhs))
  rhs_name <- as.character(substitute(rhs))
  arrow:::nse_funcs[[lhs_name]][[rhs_name]]
})

@thisisnic
Copy link
Member

Given that currently we use a "single-bucket" approach, I'd keep it that way in the short term, unless there's a specific reason to change it? We can update it later if necessary.

@dragosmg
Copy link
Contributor Author

dragosmg commented May 17, 2022

Given that currently we use a "single-bucket" approach, I'd keep it that way in the short term, unless there's a specific reason to change it? We can update it later if necessary.

Yaya, that's what I'm working on. I went for the mixed bucket approach yesterday evening and earlier today and got really messy really quickly.

@paleolimbot
Copy link
Member

I quite liked your suggestion of the single bucket but registering twice (i.e., do nse_funcs[["fun_name"]] <- fun but also nse_funcs[["pkgname::fun_name"]] <- fun if pkgname exists. There's less bookkeeping that way that could possibly get mucked up!

@paleolimbot
Copy link
Member

(Since starting to write that note I now see that you both have already hashed that out!)

@dragosmg
Copy link
Contributor Author

dragosmg commented May 17, 2022

@paleolimbot @thisisnic @jonkeane Would you mind having another look? I can't request a review from @paleolimbot

@dragosmg dragosmg requested review from jonkeane and thisisnic May 17, 2022 15:34
@dragosmg dragosmg marked this pull request as ready for review May 17, 2022 15:45
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good...I imagine the next step is to add the function prefixes (including base::) to all the regsiter_binding() calls and add some tests (maybe one per package?) to make sure that when somebody does call lubridate::as_datetime() that it works as we expect.

@dragosmg
Copy link
Contributor Author

Looking good...I imagine the next step is to add the function prefixes (including base::) to all the regsiter_binding() calls and add some tests (maybe one per package?) to make sure that when somebody does call lubridate::as_datetime() that it works as we expect.

Yep, that's the plan. There are some tests failing - I need to deal with that too.

@dragosmg dragosmg marked this pull request as draft May 18, 2022 11:52
@dragosmg
Copy link
Contributor Author

dragosmg commented May 20, 2022

outstanding issues:

# we can't (for now) use namespacing, so we need to make sure lubridate::date()
# and not base::date() is being used. This is due to the way testthat runs and
# normal use of arrow would not have to do this explicitly.
# TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
date <- lubridate::date

  • prefixing doesn't work now works for summarising / aggregating functions too

@dragosmg
Copy link
Contributor Author

I have no issue with either lubridate::date or date calls in interactive sessions. I think failures for the un-prefixed form are due to the way testthat works, so we shouldn't be concerned with those. Am I wrong?

@paleolimbot
Copy link
Member

I think you're safe to make a user prefix lubridate::date() if they do run into this error (and you're also safe to do this in the test, since it's what I'd expect a user to do in the case of an ambiguous function name).

@dragosmg dragosmg force-pushed the allow_namespacing branch from a49eda5 to 8944f5b Compare July 4, 2022 14:18
@dragosmg
Copy link
Contributor Author

dragosmg commented Jul 5, 2022

I think this is ready for another look, with the following caveats:

@dragosmg dragosmg marked this pull request as ready for review July 5, 2022 10:32
@dragosmg dragosmg requested a review from paleolimbot July 5, 2022 10:32
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few next steps for you...getting this to work in summarise() will be a set of changes with a different scope because it does its own inspection of the syntax tree. I think you should open another JIRA for it and implement it in a separate PR. I think the filter() issue you identified is actually a general issue and I think one of the review comments I left will fix it.

@dragosmg dragosmg force-pushed the allow_namespacing branch from 81aa495 to 642c23e Compare July 15, 2022 12:31
@dragosmg
Copy link
Contributor Author

dragosmg commented Jul 15, 2022

@nealrichardson I think I addressed all the feedback. I'm hoping I can maybe run the CI over the weekend if the r-hub image gets updated.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! CI failures are unrelated, and since this fixes the lint issue that has appeared elsewhere, I'm inclined to merge sooner than later.

@nealrichardson
Copy link
Member

@github-actions crossbow submit test-r-minimal-build test-r-versions

@github-actions
Copy link

Revision: ac316ee

Submitted crossbow builds: ursacomputing/crossbow @ actions-6a3852ff25

Task Status
test-r-minimal-build Azure
test-r-versions Github Actions

@nealrichardson nealrichardson merged commit 3e0eea1 into apache:master Jul 15, 2022
@ursabot
Copy link

ursabot commented Jul 15, 2022

Benchmark runs are scheduled for baseline = 29cc263 and contender = 3e0eea1. 3e0eea1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.38% ⬆️0.03%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 3e0eea12 ec2-t3-xlarge-us-east-2
[Failed] 3e0eea12 test-mac-arm
[Failed] 3e0eea12 ursa-i9-9960x
[Finished] 3e0eea12 ursa-thinkcentre-m75q
[Finished] 29cc2630 ec2-t3-xlarge-us-east-2
[Failed] 29cc2630 test-mac-arm
[Failed] 29cc2630 ursa-i9-9960x
[Finished] 29cc2630 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

kou pushed a commit that referenced this pull request Feb 20, 2023
…Hub issue numbers (#34260)

Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature.

Issue numbers have been rewritten based on the following correspondence.
Also, the pkgdown settings have been changed and updated to link to GitHub.

I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly.

---
ARROW-6338	#5198
ARROW-6364	#5201
ARROW-6323	#5169
ARROW-6278	#5141
ARROW-6360	#5329
ARROW-6533	#5450
ARROW-6348	#5223
ARROW-6337	#5399
ARROW-10850	#9128
ARROW-10624	#9092
ARROW-10386	#8549
ARROW-6994	#23308
ARROW-12774	#10320
ARROW-12670	#10287
ARROW-16828	#13484
ARROW-14989	#13482
ARROW-16977	#13514
ARROW-13404	#10999
ARROW-16887	#13601
ARROW-15906	#13206
ARROW-15280	#13171
ARROW-16144	#13183
ARROW-16511	#13105
ARROW-16085	#13088
ARROW-16715	#13555
ARROW-16268	#13550
ARROW-16700	#13518
ARROW-16807	#13583
ARROW-16871	#13517
ARROW-16415	#13190
ARROW-14821	#12154
ARROW-16439	#13174
ARROW-16394	#13118
ARROW-16516	#13163
ARROW-16395	#13627
ARROW-14848	#12589
ARROW-16407	#13196
ARROW-16653	#13506
ARROW-14575	#13160
ARROW-15271	#13170
ARROW-16703	#13650
ARROW-16444	#13397
ARROW-15016	#13541
ARROW-16776	#13563
ARROW-15622	#13090
ARROW-18131	#14484
ARROW-18305	#14581
ARROW-18285	#14615
* Closes: #33631

Authored-by: SHIMA Tatsuya <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants