Skip to content

Conversation

@karuppayya
Copy link
Contributor

What changes were proposed in this pull request?

This change(design doc) adds support to use remote storage for shuffle data storage.

The primary goal is to enhance the elasticity and resilience of Spark workloads, leading to substantial cost optimization opportunities.

This is a PoC to elicit feedback from community.

Why are the changes needed?

This change decouples storage from compute, therein helping to minimize shuffle failure and also in better scaling of the cluster.

Does this PR introduce any user-facing change?

This change adds 3 SQL configs to enable the feature
Remote storage location for shuffle
spark.shuffle.remote.storage.path=<remote storage path>

Config that determines if the feature needs to be used
spark.sql.shuffle.consolidation.enabled=true|false

Shuffle plugin to use when the feature is enabled(This needs to be configured currently, but we can switch to this automatically when feature is enabled TBD )
spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.sort.remote.HybridShuffleDataIO -

How was this patch tested?

Manual testing. Unit test to be added.
Trying to get feedback from community before writing elaborate tests.

Was this patch authored or co-authored using generative AI tooling?

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant