-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-40343: [C++] Move S3FileSystem to the registry #41559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
589e7d7 to
9cc73fc
Compare
7c7a277 to
970eaa4
Compare
970eaa4 to
976a912
Compare
976a912 to
39e2e0e
Compare
cpp/src/arrow/adapters/orc/util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used in GetOrcMajorVersion(), so it should be #included here by IWYU. I found it because compilation failed at one point (compilation can spuriously succeed when standard headers include more definitions than the standard guarantees; for example, inclusion of only <string> or <iostream> might fully define std::stringstream rather than only forward declaring it)
f8f8f2c to
130be67
Compare
cpp/src/arrow/CMakeLists.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we remove the linking of the main arrow library against AWSSDK_LINK_LIBRARIES when ARROW_S3_MODULE is ON?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to require the main arrow library to load the module. After the module can reproduce the full functionality of the S3FileSystem (S3ProxyOptions, for example) we can deprecate building s3 in the main library. That's not in scope for now, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you had any idea on how to make the module expose the missing functionality too? I've been playing around with your PR exposing LoadFileSystemFactories to pyarrow and running a small test where I can load the s3 module from pyarrow using libarrow built without S3 and I can do it but I have some missing symbols which I can get if I reload with ctypes.CDLL(libarrow_s3fs_path, mode=ctypes.RTLD_GLOBAL).
Basically this test works using libarrow without S3:
def test_filesystem_from_uri_s3(s3_server):
# Load libarrow_s3fs.so
libarrow_s3fs_path = '/home/raulcd/code/libarrow_s3fs.so'
FileSystem.load_file_system(libarrow_s3fs_path)
import ctypes
lib = ctypes.CDLL(libarrow_s3fs_path, mode=ctypes.RTLD_GLOBAL)
assert lib is not None
host, port, access_key, secret_key = s3_server['connection']
uri = "s3://{}:{}@mybucket/foo/bar?scheme=http&endpoint_override={}:{}"\
"&allow_bucket_creation=True" \
.format(access_key, secret_key, host, port)
fs, path = FileSystem.from_uri(uri)
assert path == "mybucket/foo/bar"
fs.create_dir(path)
[info] = fs.get_file_info([path])
assert info.path == path
assert info.type == FileType.DirectoryI am just trying to understand if you had an idea around the next steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you had any idea on how to make the module expose the missing functionality too?
The filesystem module supports only construction of filesystems from URI strings (and thereafter we have access only to methods on the FileSystem base interface) so a brief enumeration of functionality missing relative to pyarrow.fs.S3FileSystem is
- constructing the filesystem with
proxy_options - retrieving the AWS region a filesystem is connected to
- providing a retry_strategy to use when communicating with S3
It is possible to support all of these through URI strings alone, with some caveats:
- encoding proxy options in the
s3://URI scheme- since it already has a URI representation that can be could be added as a single URI parameter like
s3://u:p@mybucket/foo/bar?proxy_options=http%3A//proxy_u%3Aproxy_p@proxy_host%3A999 - alternatively independent parameters could be added for
proxy_scheme,proxy_username, etc
- since it already has a URI representation that can be could be added as a single URI parameter like
- AWS region is a URI parameter, so if we need to retrieve the region from a
FileSystemwe can convert it to a URI and read that parameter.- Region is resolved when the S3FileSystem is constructed, so it won't always be a copy of the region parameter in the original URI
- Python has not bound FileSystem::MakeUri yet, but after it does access to the s3 region could be written like
def get_s3_region(fs: FileSystem) -> str|None: if fs.type_name != "s3": return None from urllib.parse import urlparse, parse_qs params = parse_qs(urlparse(fs.to_uri()).query) return params["region"][0]
S3RetryStrategyis currently available in C++ as a fully user-extensible interface, which is of course not specifiable in a URI parameter. However in python at least this is only exposed as the choice between standard or default strategies and a maximum number of retries. This more limited enumeration of retry strategies can be specified in a parameter, likes3://u:p@mybucket/foo/bar?retry_strategy=standard/3- Full configuration of the retry strategy interface seems like a feature which might be little used but a survey on the ML would be a good idea. If no significant use cases arise, there can be a deprecation cycle for public S3RetryStrategy
Once the above changes are made, the full functionality of pyarrow.fs.S3FileSystem can be recovered using just libarrow_s3fs. At that point, the option to build libarrow directly linked to the AWS SDK will be redundant and can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I have some missing symbols
I don't know why that would be; LoadFileSystemFactories loads symbols locally but the only symbols in libarrow_s3fs would be
arrow_filesystem_get_registry- symbols from the AWS libs
- symbols from libarrow
... none of which I'd expect to be needed by subsequently loaded shared objects. What symbols are missing? If the AWS libs are "subsequently loaded" then I guess LoadFileSystemFactories will need to load globally as in your python snippet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, another way to handle extended features of particular FileSystems is to add extra functions to its module. For example, let's say that reading the resolved S3 region from a URI parameter as I described above did not seem natural. Instead, we could add to s3fs_module.cc
extern "C" {
ARROW_FORCE_EXPORT char* arrow_filesystem_get_s3_region(const void* fs);
}... then in python that function could be retrieved with CDLL.__getattr__.
I recommend avoiding this extra-C-functions approach if possible since it introduces a de-facto ABI that would be a well hidden sharp edge in the library. For resolving s3 region this could be as simple as "the caller is responsible for freeing the returned string if it's not null". However if we consider the more complex example of fully user programmable retry strategy, extra-C-functions would involve
- negotiating the lifetime (both allocation/deallocation and construction/destruction) of retry strategy objects
- documenting the retry strategy interface as ABI stable so that no developers modify it (which would break an older libarrow_s3fs loaded by newer libarrow)
- probably asserting that the major version of libarrow_${any}fs matches the major version of libarrow which is loading it
- possibly also rewriting the retry strategy interface as a bundle of function pointers as we do with other ABI objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this merges, I will open a draft PR filling out the missing points of S3 support I described above:
- add proxy parameters to the S3 URL format. Proxy parameters will be namespaced like
proxy.username - expose MakeUri in python and use it to extract S3 region
- add a retry strategy parameter to the S3 URL format
- deprecate the retry strategy interface
I think having a concrete patch to discuss will help focus discussion here and on the ML
cpp/src/arrow/filesystem/s3fs.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be in URIs? Has this always been the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been in place since S3 URIs were supported, #6403
The URI is intended to be a fully self contained initializer for a filesystem, so if the filesystem requires secrets for initialization then the URI must contain them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still recommended practice? It seems Azure recommends not doing this to prevent accidental exposure of credentials in logs etc. https://lists.apache.org/thread/mhwr2lvtxvjcqos12k7hr4cqkdofrxxo
I don't know anything about S3 auth though so 🤷 (in any case probably something for a follow up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz Could you start a discussion for this on dev@ like the Azure one just in case?
https://lists.apache.org/thread/mhwr2lvtxvjcqos12k7hr4cqkdofrxxo
If we should not do this, we can remove this logic as a follow-up task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can start an ML discussion, but: in arrow::fs:: URIs are intended to serve as complete specifications of a filesystem. If a filesystem cannot be constructed without a secret, then either 1. the URI contains that secret or 2. that filesystem cannot be constructed from a URI
- In the current design, filesystems simply are secrets, transitively; anyone who has the URI has access to the filesystem to which it refers.
- This should definitely be documented more rigorously
- it'd probably be a good idea for MakeUri to return a SecureString too, when that's merged
- If URIs-which-are-secrets is unacceptable, we would need to extend the
FileSystemFactoryinterface to support out-of-band secrets somehow.
- For example, we could provide a
SecretRegistryto which new secrets may be added likeGetSecretRegistry()->AddSecret({.key = "my-s3-secret-key", .secret = "sw0rdf1sh"});then reference those from URIs likes3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket} - To me this indirection seems error prone and the security increment seems minimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! I feel that it's a good summary for the discussion. :-)
As the discussion result, we may need to change the Azure filesystem implementation too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@pitrou do you want to merge this one? |
|
Yes, we probably want to revive this at some point. |
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great and I would love to push this to be merged and continue working on this. To be honest I think this is mergeable as is. We can continue working from it. I just have a couple of minor questions. Maybe we could also rebase.
@bkietz I can rebase (and maybe enable this on CI) if you don't plan to continue working on this.
cpp/src/arrow/CMakeLists.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you had any idea on how to make the module expose the missing functionality too? I've been playing around with your PR exposing LoadFileSystemFactories to pyarrow and running a small test where I can load the s3 module from pyarrow using libarrow built without S3 and I can do it but I have some missing symbols which I can get if I reload with ctypes.CDLL(libarrow_s3fs_path, mode=ctypes.RTLD_GLOBAL).
Basically this test works using libarrow without S3:
def test_filesystem_from_uri_s3(s3_server):
# Load libarrow_s3fs.so
libarrow_s3fs_path = '/home/raulcd/code/libarrow_s3fs.so'
FileSystem.load_file_system(libarrow_s3fs_path)
import ctypes
lib = ctypes.CDLL(libarrow_s3fs_path, mode=ctypes.RTLD_GLOBAL)
assert lib is not None
host, port, access_key, secret_key = s3_server['connection']
uri = "s3://{}:{}@mybucket/foo/bar?scheme=http&endpoint_override={}:{}"\
"&allow_bucket_creation=True" \
.format(access_key, secret_key, host, port)
fs, path = FileSystem.from_uri(uri)
assert path == "mybucket/foo/bar"
fs.create_dir(path)
[info] = fs.get_file_info([path])
assert info.path == path
assert info.type == FileType.DirectoryI am just trying to understand if you had an idea around the next steps.
d2d6f03 to
f555ae1
Compare
| ARROW_ORC=ON \ | ||
| ARROW_PARQUET=ON \ | ||
| ARROW_S3=ON \ | ||
| ARROW_S3_MODULE=ON \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Co-authored-by: Sutou Kouhei <[email protected]>
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 8b56d7e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
Thanks for your work here @bkietz ! |
### Rationale for this change Moving S3 to a separate module will allow separate build the module for use by builds of arrow which did not include S3 funcionality. ### What changes are included in this PR? - the factory for S3 filesystems is added to the registry - an internal function is added which unregisters functions for testing purposes ### Are these changes tested? A test for just the modularization is added which unregisters built in factories. This way we're definitely testing loading new factories from module libraries. ### Are there any user-facing changes? No * GitHub Issue: apache#40343 Lead-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Rationale for this change
Moving S3 to a separate module will allow separate build the module for use by builds of arrow which did not include S3 funcionality.
What changes are included in this PR?
Are these changes tested?
A test for just the modularization is added which unregisters built in factories. This way we're definitely testing loading new factories from module libraries.
Are there any user-facing changes?
No