Skip to content

Conversation

@wjones127
Copy link
Member

@wjones127 wjones127 commented Jul 11, 2022

BREAKING CHANGE: We had several different behaviors when passing in file paths with trailing slashes: LocalFileSystem would return IOError, S3 would trim off the trailing slash, and GCS would keep the trailing slash as part of the file name (later creating confusion as the file would be labelled a "directory" in list calls). This PR moves them all to the behavior of LocalFileSystem: return IOError.

The R filesystem bindings relied on the behavior provided by S3, since FileSystem$path() returns a SubTreeFileSystem as a convenient way to bundle a path and filesystem object and SubTreeFileSystem$base_path adds a trailing slash. To adapt to the C++ changes, the functions accepting SubTreeFileSystem as a path to a file now modified to trim the trailing slash before passing down to C++.

Here is an example of the differences in behavior between S3 and GCS:

import pyarrow.fs
from pyarrow.fs import FileSelector
from datetime import timedelta

gcs = pyarrow.fs.GcsFileSystem(
    endpoint_override="localhost:9001",
    scheme="http",
    anonymous=True,
    retry_time_limit=timedelta(seconds=1),
)

gcs.create_dir("py_test")

# Writing to test.txt with and without slash produces a file and a directory!?
with gcs.open_output_stream("py_test/test.txt") as out_stream:
    out_stream.write(b"Hello world!")
with gcs.open_output_stream("py_test/test.txt/") as out_stream:
    out_stream.write(b"Hello world!")
gcs.get_file_info(FileSelector("py_test"))
# [<FileInfo for 'py_test/test.txt': type=FileType.File, size=12>, <FileInfo for 'py_test/test.txt': type=FileType.Directory>]

s3 = pyarrow.fs.S3FileSystem(
    access_key="minioadmin",
    secret_key="minioadmin",
    scheme="http",
    endpoint_override="localhost:9000",
    allow_bucket_creation=True,
    allow_bucket_deletion=True,
)

s3.create_dir("py-test")

# Writing to test.txt with and without slash writes to same file
with s3.open_output_stream("py-test/test.txt") as out_stream:
    out_stream.write(b"Hello world!")
with s3.open_output_stream("py-test/test.txt/") as out_stream:
    out_stream.write(b"Hello world!")
s3.get_file_info(FileSelector("py-test"))
# [<FileInfo for 'py-test/test.txt': type=FileType.File, size=12>]

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@wjones127
Copy link
Member Author

cc @emkornfield @pitrou @coryan Does this behavior seem reasonable to you?

@emkornfield
Copy link
Contributor

Seems reasonable to me. Is it possible to test this against a generated dataset to make sure nothing breaks there?

@wjones127
Copy link
Member Author

Seems reasonable to me. Is it possible to test this against a generated dataset to make sure nothing breaks there?

Do these tests from R seem sufficient? https://github.com/apache/arrow/blob/81af2912b5e088c83631df4cfd52e6d6df2878dc/r/tests/testthat/helper-filesystems.R#L130-L156

I pulled the changes from this PR into #13542, so the above tests will run in CI there. (I can confirm they pass locally for GCS and S3.)

@wjones127 wjones127 marked this pull request as ready for review July 12, 2022 04:55
@jorisvandenbossche
Copy link
Member

Just wondering: to what extent is this a corner case that just happens to be in our tests, or is there a practical value in allowing trailing slashes?

Because for example also the LocalFileSystem does not allow trailing slashes in a similar example:

In [51]: from pyarrow.fs import LocalFileSystem

In [52]: local = LocalFileSystem()

In [53]: local.create_dir("py-test")

In [54]: with local.open_output_stream("py-test/test.txt/") as out_stream:
    ...:     out_stream.write(b"Hello world!")
...
IsADirectoryError: [Errno 21] Failed to open local file 'py-test/test.txt/'
Detail: [errno 21] Is a directory

(which is an error that makes sense to me, also Python's open will complain that it is a directory and not a file in the equivalent example)

@pitrou
Copy link
Member

pitrou commented Jul 12, 2022

I agree it seems better to disallow trailing slashes in filenames rather than silently dropping them.

@wjones127
Copy link
Member Author

Just wondering: to what extent is this a corner case that just happens to be in our tests, or is there a practical value in allowing trailing slashes?

I'm not sure yet what else relies on that. In R, it's just that we pass fs$path("some/path") to functions like write_parquet(). fs$path() returns a SubTreeFilesystem, which I think is just used as a convenient way to pass a path (as the base path) and a filesystem in one object. But as a side effect of using base_path to transmit the path, we add a trailing slash. Using this method fails for LocalFileSystem though:

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

example_data <- arrow_table(x = Array$create(c(1, 2, 3)))

fs <- LocalFileSystem$create()
write_parquet(example_data, fs$path("test.parquet"))
#> Error: IOError: Failed to open local file 'test.parquet/'
#> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/filesystem/localfs.cc:442  ::arrow::internal::FileOpenWritable(fn, write_only, truncate, append). Detail: [errno 2] No such file or directory

Created on 2022-07-12 by the reprex package (v2.0.1)

cc @nealrichardson in case he has any input.

I will see about changing GCS to reject file paths that end with slashes. Should I leave S3 alone? Or should it also reject them?

@pitrou
Copy link
Member

pitrou commented Jul 12, 2022

Hmm, SubTreeFilesystem is meant to take a subdirectory parameter, you are not expected to use it to pass a filename.

@pitrou
Copy link
Member

pitrou commented Jul 12, 2022

Should I leave S3 alone? Or should it also reject them?

It would be fine with me to remove them, but if you hit too many regressions then no need to sweat over it either :-)

@nealrichardson
Copy link
Member

Hmm, SubTreeFilesystem is meant to take a subdirectory parameter, you are not expected to use it to pass a filename.

Historical context: ARROW-10254, which points to #8351 (comment)

We could have the single-file writers (perhaps the readers too) in R prune a possible trailing slash from filenames, if this is the only source of the issue. That logic is pretty well encapsulated in make_readable_file() and make_output_stream() so it should be feasible to do. Shouldn't be a concern for write_dataset/open_dataset since they point at directories.

Comment on lines 143 to 145
Status AssertNoTrailingSlash(const std::string& key) {
if (key.back() == '/') {
return NotAFile(key);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would take util::string_view instead but NotAFile only accepts const std::string&. Would it be alright if I moved "arrow/filesystem/util_internal.h" to use util::string_view?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely ok!

@wjones127 wjones127 changed the title ARROW-17045: [C++] Ignore trailing slashes on files in GCS ARROW-17045: [C++] Reject trailing slashes on file path Jul 12, 2022
file <- file$base_path
# SubTreeFileSystem adds a slash to base_path, but filesystems will reject file names
# with trailing slashes, so we need to remove it here.
file <- sub("/$", "", file$base_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably misreading this, but is this treating any SubTreeFileSystem as pointing to a local file path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "filesystems" I meant the Arrow FileSystem classes, not the local file system. Does that clarify your confusion? Or are you asking something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I am asking something else. It seems that this code is replacing file (a FileSystem instance) with a file path, is that right? And below, the file path file will be treated as a local filesystem path?

Copy link
Member Author

@wjones127 wjones127 Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Yeah I just blindly updated this without thinking about what happens downstream 🤦

Below it will go through the code path where is.string(file) and !is.null(filesystem) are both TRUE, so it will later call file <- filesystem$OpenInputFile(file). So it went from SubTreeFilesystem to a CharacterVector to a <whatever OpenInputFile returns>. Wow that is some very dynamic typing 😵

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the path should be treated as correct filesystem since we extracted that out in the line before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, I had missed the filesystem <- file$base_fs. So I guess my last question is: why are we calling make_readable_file with a SubTreeFileSystem as the first argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation is that users will call FileSystem$path() to specify the location to write/read to:

fs <- S3FileSystem$create()
write_parquet(my_tab, fs$path("my/path/to"))

fs$path returns a SubTreeFileSystem as a convenient way to bundle a path and filesystem object in a single object. See discussion: #8351 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. That's unfortunate, but that's not this PR's business. Thanks for the details!

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R side LGTM, thanks!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me, this is a reasonable straightening of the API. Thank you @wjones127 for doing this!

@pitrou pitrou merged commit 0024962 into apache:master Jul 13, 2022
@wjones127 wjones127 deleted the ARROW-17045-gcs-slashes branch July 13, 2022 17:12
@jorisvandenbossche
Copy link
Member

There are some HDFS failures that might be related to this change? See eg https://github.com/ursacomputing/crossbow/runs/7331664487?check_suite_focus=true

@wjones127
Copy link
Member Author

Oh no! Those do look related @jorisvandenbossche. Is HDFS not in our usual set of CI tests?

@ursabot
Copy link

ursabot commented Jul 14, 2022

Benchmark runs are scheduled for baseline = 03e80dc and contender = 0024962. 0024962 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.75% ⬆️0.1%] test-mac-arm
[Failed ⬇️0.57% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️2.32% ⬆️0.11%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0024962f ec2-t3-xlarge-us-east-2
[Finished] 0024962f test-mac-arm
[Failed] 0024962f ursa-i9-9960x
[Finished] 0024962f ursa-thinkcentre-m75q
[Finished] 03e80dc1 ec2-t3-xlarge-us-east-2
[Finished] 03e80dc1 test-mac-arm
[Failed] 03e80dc1 ursa-i9-9960x
[Finished] 03e80dc1 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@pitrou
Copy link
Member

pitrou commented Jul 14, 2022

@wjones127 They are in the nightly builds but not in the PR checks. Look for hdfs in archery docker images and you'll probably be able to reproduce locally :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants