-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-17045: [C++] Reject trailing slashes on file path #13577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
cc @emkornfield @pitrou @coryan Does this behavior seem reasonable to you? |
|
Seems reasonable to me. Is it possible to test this against a generated dataset to make sure nothing breaks there? |
Do these tests from R seem sufficient? https://github.com/apache/arrow/blob/81af2912b5e088c83631df4cfd52e6d6df2878dc/r/tests/testthat/helper-filesystems.R#L130-L156 I pulled the changes from this PR into #13542, so the above tests will run in CI there. (I can confirm they pass locally for GCS and S3.) |
|
Just wondering: to what extent is this a corner case that just happens to be in our tests, or is there a practical value in allowing trailing slashes? Because for example also the LocalFileSystem does not allow trailing slashes in a similar example: (which is an error that makes sense to me, also Python's |
|
I agree it seems better to disallow trailing slashes in filenames rather than silently dropping them. |
I'm not sure yet what else relies on that. In R, it's just that we pass library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
example_data <- arrow_table(x = Array$create(c(1, 2, 3)))
fs <- LocalFileSystem$create()
write_parquet(example_data, fs$path("test.parquet"))
#> Error: IOError: Failed to open local file 'test.parquet/'
#> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/filesystem/localfs.cc:442 ::arrow::internal::FileOpenWritable(fn, write_only, truncate, append). Detail: [errno 2] No such file or directoryCreated on 2022-07-12 by the reprex package (v2.0.1) cc @nealrichardson in case he has any input. I will see about changing GCS to reject file paths that end with slashes. Should I leave S3 alone? Or should it also reject them? |
|
Hmm, |
It would be fine with me to remove them, but if you hit too many regressions then no need to sweat over it either :-) |
Historical context: ARROW-10254, which points to #8351 (comment) We could have the single-file writers (perhaps the readers too) in R prune a possible trailing slash from filenames, if this is the only source of the issue. That logic is pretty well encapsulated in |
| Status AssertNoTrailingSlash(const std::string& key) { | ||
| if (key.back() == '/') { | ||
| return NotAFile(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would take util::string_view instead but NotAFile only accepts const std::string&. Would it be alright if I moved "arrow/filesystem/util_internal.h" to use util::string_view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely ok!
| file <- file$base_path | ||
| # SubTreeFileSystem adds a slash to base_path, but filesystems will reject file names | ||
| # with trailing slashes, so we need to remove it here. | ||
| file <- sub("/$", "", file$base_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably misreading this, but is this treating any SubTreeFileSystem as pointing to a local file path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "filesystems" I meant the Arrow FileSystem classes, not the local file system. Does that clarify your confusion? Or are you asking something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am asking something else. It seems that this code is replacing file (a FileSystem instance) with a file path, is that right? And below, the file path file will be treated as a local filesystem path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Yeah I just blindly updated this without thinking about what happens downstream 🤦
Below it will go through the code path where is.string(file) and !is.null(filesystem) are both TRUE, so it will later call file <- filesystem$OpenInputFile(file). So it went from SubTreeFilesystem to a CharacterVector to a <whatever OpenInputFile returns>. Wow that is some very dynamic typing 😵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the path should be treated as correct filesystem since we extracted that out in the line before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, I had missed the filesystem <- file$base_fs. So I guess my last question is: why are we calling make_readable_file with a SubTreeFileSystem as the first argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expectation is that users will call FileSystem$path() to specify the location to write/read to:
fs <- S3FileSystem$create()
write_parquet(my_tab, fs$path("my/path/to"))fs$path returns a SubTreeFileSystem as a convenient way to bundle a path and filesystem object in a single object. See discussion: #8351 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. That's unfortunate, but that's not this PR's business. Thanks for the details!
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R side LGTM, thanks!
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me, this is a reasonable straightening of the API. Thank you @wjones127 for doing this!
|
There are some HDFS failures that might be related to this change? See eg https://github.com/ursacomputing/crossbow/runs/7331664487?check_suite_focus=true |
|
Oh no! Those do look related @jorisvandenbossche. Is HDFS not in our usual set of CI tests? |
|
Benchmark runs are scheduled for baseline = 03e80dc and contender = 0024962. 0024962 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
@wjones127 They are in the nightly builds but not in the PR checks. Look for hdfs in |
…13615) Follow up to #13577 / ARROW-17045. Authored-by: Will Jones <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
BREAKING CHANGE: We had several different behaviors when passing in file paths with trailing slashes: LocalFileSystem would return IOError, S3 would trim off the trailing slash, and GCS would keep the trailing slash as part of the file name (later creating confusion as the file would be labelled a "directory" in list calls). This PR moves them all to the behavior of LocalFileSystem: return IOError.
The R filesystem bindings relied on the behavior provided by S3, since
FileSystem$path()returns aSubTreeFileSystemas a convenient way to bundle a path and filesystem object andSubTreeFileSystem$base_pathadds a trailing slash. To adapt to the C++ changes, the functions acceptingSubTreeFileSystemas a path to a file now modified to trim the trailing slash before passing down to C++.Here is an example of the differences in behavior between S3 and GCS: