Skip to content

Conversation

@dragosmg
Copy link
Contributor

@dragosmg dragosmg commented May 17, 2022

This PR adds:

  • a binding emulating lubridate's fast_strptime functionality

This PR does not support the following arguments:

  • lt = TRUE - this returns a POSIXlt object (a list) which cannot be easily used in a dplyr pipeline (currently it actually errors) => we have a different default lt = FALSE for the Arrow binding, and
  • cutoff_2000 = 68L - for the y% format two-digit numbers smaller or equal to cutoff_2000 are parsed as though starting with 20, otherwise parsed as though starting with 19. It would be nice to have this, so I raised ARROW-16596. We can always suggest users they manipulate the strings before parsing, so I don't think this is crucial functionality.

The following code will be possible once the PR is merged:

library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)

dates_table <- tibble(
  string_with_short_year = c("68-05-17", "69-05-17", "55-05-17")
)

dates_table %>% 
  mutate(
    date = fast_strptime(
      string_with_short_year, 
      format = "%y-%m-%d",
      lt = FALSE
    )
  )
#> # A tibble: 3 × 2
#>   string_with_short_year date               
#>   <chr>                  <dttm>             
#> 1 68-05-17               2068-05-17 00:00:00
#> 2 69-05-17               1969-05-17 00:00:00
#> 3 55-05-17               2055-05-17 00:00:00

dates_table %>% 
  arrow_table() %>% 
  mutate(
    date = fast_strptime(
      string_with_short_year, 
      format = "%y-%m-%d",
      lt = FALSE
    )
  ) %>%
  collect()
#> # A tibble: 3 × 2
#>   string_with_short_year date               
#>   <chr>                  <dttm>             
#> 1 68-05-17               2068-05-17 00:00:00
#> 2 69-05-17               1969-05-17 00:00:00
#> 3 55-05-17               2055-05-17 00:00:00

Created on 2022-05-18 by the reprex package (v2.0.1)

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@dragosmg dragosmg marked this pull request as ready for review May 17, 2022 20:25
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@dragosmg dragosmg changed the title ARROW-16439: [R] Implement bindings for lubridate's fast_strptime ARROW-16439: [R] Implement bindings for lubridate's fast_strptime May 18, 2022
@dragosmg dragosmg changed the title ARROW-16439: [R] Implement bindings for lubridate's fast_strptime ARROW-16439: [R] Implement binding for lubridate::fast_strptime May 18, 2022
@dragosmg dragosmg requested a review from jonkeane May 19, 2022 08:41
@dragosmg
Copy link
Contributor Author

dragosmg commented May 19, 2022

There is a slight problem though - in the way the fast_strptime binding parses a string like "68-12-12 12:34:56" when the format is "%Y..." (incorrect) and not "%y..." (correct).
base R and lubridate behaviour (currently the fast_strptime binding behaves like base R - i.e. doesn't error, but parses a short year xy to 00xy):

b <- "68-10-07 19:04:0"
strptime(b, format = "%Y-%m-%d %H:%M:%S")
#> [1] "0068-10-07 19:04:00 LMT"
strptime(b, format = "%y-%m-%d %H:%M:%S")
#> [1] "2068-10-07 19:04:00 BST"

lubridate::fast_strptime(b, format = "%Y-%m-%d %H:%M:%S")
#> [1] NA
lubridate::fast_strptime(b, format = "%y-%m-%d %H:%M:%S")
#> [1] "2068-10-07 19:04:00 UTC"

Created on 2022-05-19 by the reprex package (v2.0.1)

This could trip users up (mostly when they pass multiple formats):

library(arrow, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

dates_table <- tibble(
  string_with_short_year = c("68-05-17", "69-05-17", "55-05-17")
)

dates_table %>% 
  mutate(
    date = fast_strptime(
      string_with_short_year, 
      format = c("%Y-%m-%d", "%y-%m-%d"),
      lt = TRUE
    )
  )
#> # A tibble: 3 × 2
#>   string_with_short_year date               
#>   <chr>                  <dttm>             
#> 1 68-05-17               2068-05-17 00:00:00
#> 2 69-05-17               1969-05-17 00:00:00
#> 3 55-05-17               2055-05-17 00:00:00

dates_table %>% 
  arrow_table() %>% 
  mutate(
    date = fast_strptime(
      string_with_short_year, 
      format = c("%Y-%m-%d", "%y-%m-%d"),
      lt = FALSE
    )
  ) %>% 
  collect()
#> # A tibble: 3 × 2
#>   string_with_short_year date               
#>   <chr>                  <dttm>             
#> 1 68-05-17               0068-05-17 00:00:00
#> 2 69-05-17               0069-05-17 00:00:00
#> 3 55-05-17               0055-05-17 00:00:00

Created on 2022-05-19 by the reprex package (v2.0.1)

What happens here 👆🏻 is that lubridate::fast_strptime() fails on the first format and moves on the the second, while the arrow binding doesn't.

I flagged this as part of ARROW-16596. Not quite sure how to go about handling that only on the R side though. arrow's strptime defers to the platform's strptime so we might see different behaviours on different platforms too.

@amol- amol- closed this in 7f31c9d May 19, 2022
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last cleanup

Comment on lines +1903 to +1905
)#,
# arrow does not preserve the `tzone` attribute
# test ignore_attr = TRUE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove these commented lines, yeah?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I yes, sorry. Forgot about those. Do I open a minor PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should I just do it in one of the other PR I have going?

Copy link
Contributor Author

@dragosmg dragosmg May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonkeane I removed the tests in dragosmg@bde85ce

@ursabot
Copy link

ursabot commented May 20, 2022

Benchmark runs are scheduled for baseline = 8394571 and contender = 7f31c9d. 7f31c9d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.74% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.04% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7f31c9d2 ec2-t3-xlarge-us-east-2
[Failed] 7f31c9d2 test-mac-arm
[Failed] 7f31c9d2 ursa-i9-9960x
[Finished] 7f31c9d2 ursa-thinkcentre-m75q
[Finished] 83945714 ec2-t3-xlarge-us-east-2
[Failed] 83945714 test-mac-arm
[Failed] 83945714 ursa-i9-9960x
[Finished] 83945714 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

dragosmg added a commit to dragosmg/arrow that referenced this pull request Jun 20, 2022
kou pushed a commit that referenced this pull request Feb 20, 2023
…Hub issue numbers (#34260)

Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature.

Issue numbers have been rewritten based on the following correspondence.
Also, the pkgdown settings have been changed and updated to link to GitHub.

I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly.

---
ARROW-6338	#5198
ARROW-6364	#5201
ARROW-6323	#5169
ARROW-6278	#5141
ARROW-6360	#5329
ARROW-6533	#5450
ARROW-6348	#5223
ARROW-6337	#5399
ARROW-10850	#9128
ARROW-10624	#9092
ARROW-10386	#8549
ARROW-6994	#23308
ARROW-12774	#10320
ARROW-12670	#10287
ARROW-16828	#13484
ARROW-14989	#13482
ARROW-16977	#13514
ARROW-13404	#10999
ARROW-16887	#13601
ARROW-15906	#13206
ARROW-15280	#13171
ARROW-16144	#13183
ARROW-16511	#13105
ARROW-16085	#13088
ARROW-16715	#13555
ARROW-16268	#13550
ARROW-16700	#13518
ARROW-16807	#13583
ARROW-16871	#13517
ARROW-16415	#13190
ARROW-14821	#12154
ARROW-16439	#13174
ARROW-16394	#13118
ARROW-16516	#13163
ARROW-16395	#13627
ARROW-14848	#12589
ARROW-16407	#13196
ARROW-16653	#13506
ARROW-14575	#13160
ARROW-15271	#13170
ARROW-16703	#13650
ARROW-16444	#13397
ARROW-15016	#13541
ARROW-16776	#13563
ARROW-15622	#13090
ARROW-18131	#14484
ARROW-18305	#14581
ARROW-18285	#14615
* Closes: #33631

Authored-by: SHIMA Tatsuya <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants