Skip to content

[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

@Tom-Newton

Description

@Tom-Newton

Describe the enhancement requested

Optimisation to #37511
Child of #18014

When reading from Azure blob storage the bandwidth we get per connection is very dependant on the latency to the filesystem. To achieve good bandwidth with high latency far greater concurrency is needed. For example this is relevant when reading from blob storage in a different region to your compute.

As an example lets consider reading a parquet file. There are 2 levels of parallelism that I'm aware of when using Arrow and the native AzureFileSystem:

  1. Arrow will make concurrent calls to ReadAt for each column and row group combination. At most we can have one concurrent connection per column and row group combination, so for small parquet files this may be less than we would like.
  2. Within ReadAt the AzureFileSystem calls BlobClient::DownloadTo which implements some extra concurrency internally https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516. Purpose of this issue is to make the config options for this parallelism configurable by the user.

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions