Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented May 6, 2024

Rationale for this change

Moving S3 to a separate module will allow separate build the module for use by builds of arrow which did not include S3 funcionality.

What changes are included in this PR?

  • the factory for S3 filesystems is added to the registry
  • an internal function is added which unregisters functions for testing purposes

Are these changes tested?

A test for just the modularization is added which unregisters built in factories. This way we're definitely testing loading new factories from module libraries.

Are there any user-facing changes?

No

@bkietz bkietz force-pushed the 40343-s3fs-into-registry branch from 589e7d7 to 9cc73fc Compare May 16, 2024 18:17
@bkietz bkietz force-pushed the 40343-s3fs-into-registry branch 2 times, most recently from 7c7a277 to 970eaa4 Compare May 23, 2024 21:25
@bkietz bkietz marked this pull request as ready for review May 23, 2024 21:25
@bkietz bkietz requested a review from wgtmac as a code owner May 23, 2024 21:25
@bkietz bkietz force-pushed the 40343-s3fs-into-registry branch from 970eaa4 to 976a912 Compare September 13, 2024 22:04
@bkietz bkietz requested a review from felipecrv September 13, 2024 22:10
@bkietz bkietz force-pushed the 40343-s3fs-into-registry branch from 976a912 to 39e2e0e Compare September 17, 2024 16:32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in GetOrcMajorVersion(), so it should be #included here by IWYU. I found it because compilation failed at one point (compilation can spuriously succeed when standard headers include more definitions than the standard guarantees; for example, inclusion of only <string> or <iostream> might fully define std::stringstream rather than only forward declaring it)

@bkietz bkietz force-pushed the 40343-s3fs-into-registry branch 2 times, most recently from f8f8f2c to 130be67 Compare September 18, 2024 01:30
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 18, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we remove the linking of the main arrow library against AWSSDK_LINK_LIBRARIES when ARROW_S3_MODULE is ON?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to require the main arrow library to load the module. After the module can reproduce the full functionality of the S3FileSystem (S3ProxyOptions, for example) we can deprecate building s3 in the main library. That's not in scope for now, though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you had any idea on how to make the module expose the missing functionality too? I've been playing around with your PR exposing LoadFileSystemFactories to pyarrow and running a small test where I can load the s3 module from pyarrow using libarrow built without S3 and I can do it but I have some missing symbols which I can get if I reload with ctypes.CDLL(libarrow_s3fs_path, mode=ctypes.RTLD_GLOBAL).
Basically this test works using libarrow without S3:

def test_filesystem_from_uri_s3(s3_server):
    # Load libarrow_s3fs.so
    libarrow_s3fs_path = '/home/raulcd/code/libarrow_s3fs.so'
    FileSystem.load_file_system(libarrow_s3fs_path)
    import ctypes
    lib = ctypes.CDLL(libarrow_s3fs_path, mode=ctypes.RTLD_GLOBAL)
    assert lib is not None

    host, port, access_key, secret_key = s3_server['connection']

    uri = "s3://{}:{}@mybucket/foo/bar?scheme=http&endpoint_override={}:{}"\
          "&allow_bucket_creation=True" \
          .format(access_key, secret_key, host, port)
    fs, path = FileSystem.from_uri(uri)
    assert path == "mybucket/foo/bar"
    fs.create_dir(path)
    [info] = fs.get_file_info([path])
    assert info.path == path
    assert info.type == FileType.Directory

I am just trying to understand if you had an idea around the next steps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you had any idea on how to make the module expose the missing functionality too?

The filesystem module supports only construction of filesystems from URI strings (and thereafter we have access only to methods on the FileSystem base interface) so a brief enumeration of functionality missing relative to pyarrow.fs.S3FileSystem is

  • constructing the filesystem with proxy_options
  • retrieving the AWS region a filesystem is connected to
  • providing a retry_strategy to use when communicating with S3

It is possible to support all of these through URI strings alone, with some caveats:

  • encoding proxy options in the s3:// URI scheme
    • since it already has a URI representation that can be could be added as a single URI parameter like s3://u:p@mybucket/foo/bar?proxy_options=http%3A//proxy_u%3Aproxy_p@proxy_host%3A999
    • alternatively independent parameters could be added for proxy_scheme, proxy_username, etc
  • AWS region is a URI parameter, so if we need to retrieve the region from a FileSystem we can convert it to a URI and read that parameter.
    • Region is resolved when the S3FileSystem is constructed, so it won't always be a copy of the region parameter in the original URI
    • Python has not bound FileSystem::MakeUri yet, but after it does access to the s3 region could be written like
      def get_s3_region(fs: FileSystem) -> str|None:
          if fs.type_name != "s3":
              return None
      
          from urllib.parse import urlparse, parse_qs
          params = parse_qs(urlparse(fs.to_uri()).query)
          return params["region"][0]
      
  • S3RetryStrategy is currently available in C++ as a fully user-extensible interface, which is of course not specifiable in a URI parameter. However in python at least this is only exposed as the choice between standard or default strategies and a maximum number of retries. This more limited enumeration of retry strategies can be specified in a parameter, like s3://u:p@mybucket/foo/bar?retry_strategy=standard/3
    • Full configuration of the retry strategy interface seems like a feature which might be little used but a survey on the ML would be a good idea. If no significant use cases arise, there can be a deprecation cycle for public S3RetryStrategy

Once the above changes are made, the full functionality of pyarrow.fs.S3FileSystem can be recovered using just libarrow_s3fs. At that point, the option to build libarrow directly linked to the AWS SDK will be redundant and can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I have some missing symbols

I don't know why that would be; LoadFileSystemFactories loads symbols locally but the only symbols in libarrow_s3fs would be

  • arrow_filesystem_get_registry
  • symbols from the AWS libs
  • symbols from libarrow

... none of which I'd expect to be needed by subsequently loaded shared objects. What symbols are missing? If the AWS libs are "subsequently loaded" then I guess LoadFileSystemFactories will need to load globally as in your python snippet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, another way to handle extended features of particular FileSystems is to add extra functions to its module. For example, let's say that reading the resolved S3 region from a URI parameter as I described above did not seem natural. Instead, we could add to s3fs_module.cc

extern "C" {
ARROW_FORCE_EXPORT char* arrow_filesystem_get_s3_region(const void* fs);
}

... then in python that function could be retrieved with CDLL.__getattr__.

I recommend avoiding this extra-C-functions approach if possible since it introduces a de-facto ABI that would be a well hidden sharp edge in the library. For resolving s3 region this could be as simple as "the caller is responsible for freeing the returned string if it's not null". However if we consider the more complex example of fully user programmable retry strategy, extra-C-functions would involve

  • negotiating the lifetime (both allocation/deallocation and construction/destruction) of retry strategy objects
  • documenting the retry strategy interface as ABI stable so that no developers modify it (which would break an older libarrow_s3fs loaded by newer libarrow)
  • probably asserting that the major version of libarrow_${any}fs matches the major version of libarrow which is loading it
  • possibly also rewriting the retry strategy interface as a bundle of function pointers as we do with other ABI objects

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulcd

After this merges, I will open a draft PR filling out the missing points of S3 support I described above:

  • add proxy parameters to the S3 URL format. Proxy parameters will be namespaced like proxy.username
  • expose MakeUri in python and use it to extract S3 region
  • add a retry strategy parameter to the S3 URL format
  • deprecate the retry strategy interface

I think having a concrete patch to discuss will help focus discussion here and on the ML

Comment on lines +3043 to +3095
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be in URIs? Has this always been the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been in place since S3 URIs were supported, #6403

The URI is intended to be a fully self contained initializer for a filesystem, so if the filesystem requires secrets for initialization then the URI must contain them

Copy link
Member

@assignUser assignUser Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still recommended practice? It seems Azure recommends not doing this to prevent accidental exposure of credentials in logs etc. https://lists.apache.org/thread/mhwr2lvtxvjcqos12k7hr4cqkdofrxxo

I don't know anything about S3 auth though so 🤷 (in any case probably something for a follow up)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz Could you start a discussion for this on dev@ like the Azure one just in case?
https://lists.apache.org/thread/mhwr2lvtxvjcqos12k7hr4cqkdofrxxo

If we should not do this, we can remove this logic as a follow-up task.

Copy link
Member Author

@bkietz bkietz Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can start an ML discussion, but: in arrow::fs:: URIs are intended to serve as complete specifications of a filesystem. If a filesystem cannot be constructed without a secret, then either 1. the URI contains that secret or 2. that filesystem cannot be constructed from a URI

  1. In the current design, filesystems simply are secrets, transitively; anyone who has the URI has access to the filesystem to which it refers.
  • This should definitely be documented more rigorously
  • it'd probably be a good idea for MakeUri to return a SecureString too, when that's merged
  1. If URIs-which-are-secrets is unacceptable, we would need to extend the FileSystemFactory interface to support out-of-band secrets somehow.
  • For example, we could provide a SecretRegistry to which new secrets may be added like GetSecretRegistry()->AddSecret({.key = "my-s3-secret-key", .secret = "sw0rdf1sh"}); then reference those from URIs like s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}
  • To me this indirection seems error prone and the security increment seems minimal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! I feel that it's a good summary for the discussion. :-)
As the discussion result, we may need to change the Azure filesystem implementation too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Sep 20, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Sep 26, 2024
@felipecrv
Copy link
Contributor

@pitrou do you want to merge this one?

@raulcd
Copy link
Member

raulcd commented Dec 20, 2024

@pitrou is this something we want to rebase / revive? Would be great to continue with the great work that @bkietz started

@pitrou
Copy link
Member

pitrou commented Dec 20, 2024

Yes, we probably want to revive this at some point.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great and I would love to push this to be merged and continue working on this. To be honest I think this is mergeable as is. We can continue working from it. I just have a couple of minor questions. Maybe we could also rebase.
@bkietz I can rebase (and maybe enable this on CI) if you don't plan to continue working on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you had any idea on how to make the module expose the missing functionality too? I've been playing around with your PR exposing LoadFileSystemFactories to pyarrow and running a small test where I can load the s3 module from pyarrow using libarrow built without S3 and I can do it but I have some missing symbols which I can get if I reload with ctypes.CDLL(libarrow_s3fs_path, mode=ctypes.RTLD_GLOBAL).
Basically this test works using libarrow without S3:

def test_filesystem_from_uri_s3(s3_server):
    # Load libarrow_s3fs.so
    libarrow_s3fs_path = '/home/raulcd/code/libarrow_s3fs.so'
    FileSystem.load_file_system(libarrow_s3fs_path)
    import ctypes
    lib = ctypes.CDLL(libarrow_s3fs_path, mode=ctypes.RTLD_GLOBAL)
    assert lib is not None

    host, port, access_key, secret_key = s3_server['connection']

    uri = "s3://{}:{}@mybucket/foo/bar?scheme=http&endpoint_override={}:{}"\
          "&allow_bucket_creation=True" \
          .format(access_key, secret_key, host, port)
    fs, path = FileSystem.from_uri(uri)
    assert path == "mybucket/foo/bar"
    fs.create_dir(path)
    [info] = fs.get_file_info([path])
    assert info.path == path
    assert info.type == FileType.Directory

I am just trying to understand if you had an idea around the next steps.

@bkietz bkietz force-pushed the 40343-s3fs-into-registry branch from d2d6f03 to f555ae1 Compare April 4, 2025 21:42
@bkietz bkietz requested review from jonkeane and kou as code owners April 4, 2025 21:42
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 4, 2025
ARROW_ORC=ON \
ARROW_PARQUET=ON \
ARROW_S3=ON \
ARROW_S3_MODULE=ON \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulcd @pitrou I have rebased and enabled the S3 module build and test here on conda-cpp. Unless there are further comments, I'll merge on Monday

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 4, 2025
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Apr 5, 2025
Co-authored-by: Sutou Kouhei <[email protected]>
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting merge Awaiting merge awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 5, 2025
@bkietz bkietz merged commit 8b56d7e into apache:main Apr 7, 2025
39 of 40 checks passed
@bkietz bkietz removed the awaiting changes Awaiting changes label Apr 7, 2025
@bkietz bkietz deleted the 40343-s3fs-into-registry branch April 7, 2025 22:24
@github-actions github-actions bot added the awaiting changes Awaiting changes label Apr 8, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 8b56d7e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

@raulcd
Copy link
Member

raulcd commented Apr 8, 2025

Thanks for your work here @bkietz !

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Apr 15, 2025
### Rationale for this change

Moving S3 to a separate module will allow separate build the module for use by builds of arrow which did not include S3 funcionality.

### What changes are included in this PR?

- the factory for S3 filesystems is added to the registry
- an internal function is added which unregisters functions for testing purposes

### Are these changes tested?

A test for just the modularization is added which unregisters built in factories. This way we're definitely testing loading new factories from module libraries.

### Are there any user-facing changes?

No

* GitHub Issue: apache#40343

Lead-authored-by: Benjamin Kietzman <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants