-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
Component: C++Status: stale-warningIssues and PRs flagged as stale which are due to be closed if no indication otherwiseIssues and PRs flagged as stale which are due to be closed if no indication otherwiseType: enhancement
Description
Describe the enhancement requested
Optimisation to #37511
Child of #18014
When reading from Azure blob storage the bandwidth we get per connection is very dependant on the latency to the filesystem. To achieve good bandwidth with high latency far greater concurrency is needed. For example this is relevant when reading from blob storage in a different region to your compute.
As an example lets consider reading a parquet file. There are 2 levels of parallelism that I'm aware of when using Arrow and the native AzureFileSystem:
- Arrow will make concurrent calls to
ReadAtfor each column and row group combination. At most we can have one concurrent connection per column and row group combination, so for small parquet files this may be less than we would like. - Within
ReadAttheAzureFileSystemcallsBlobClient::DownloadTowhich implements some extra concurrency internally https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516. Purpose of this issue is to make the config options for this parallelism configurable by the user.
Component(s)
C++
Metadata
Metadata
Assignees
Labels
Component: C++Status: stale-warningIssues and PRs flagged as stale which are due to be closed if no indication otherwiseIssues and PRs flagged as stale which are due to be closed if no indication otherwiseType: enhancement