-
Notifications
You must be signed in to change notification settings - Fork 163
[#1608][part-5] feat(spark3): always use the available assignment #1652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…and load balance for huge partition
Test Results 2 391 files + 42 2 391 suites +42 4h 41m 32s ⏱️ + 29m 39s Results for commit f44f6a4. ± Comparison against base commit 60fce8e. This pull request removes 5 and adds 11 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
| import org.apache.uniffle.common.exception.RssException; | ||
|
|
||
| /** This class is to wrap the shuffleHandleInfo to speed up the partitionAssignment getting. */ | ||
| public class ShuffleHandleInfoWrapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not adding this caching to ShuffleHandleInfo? With ShuffleHandleInfo.updateAssignmentPlan fetching and storing latestAssignment and ShuffleHandleInfo.getPartitionAssignment(taskAttemptId) providing the assignment from that Map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShuffleHandleInfo is always maintain the latest version in spark driver shuffleManager. But for shuffle writer, the holding shuffleHandleInfo is partially latest, which is updated by the grpc handle.
If using your way, it maybe cause some questions. Because the same object has different usage. It looks not clear.
Feel free to discuss more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are saying you do not want the ShuffleHandleInfo hold by the shuffle writer to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not accurate, I hope the updated latest cache is not maintained into the shuffleHandleInfo. Because it is used by the shuffle writer, but the shuffleHandleInfo is transferred by the grpc to writer(updated by the reassign rpc everytime). So this cache is not good for the shareable for all the tasks. maybe using a independent handle wrapper to hold cache is more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling ShuffleHandleInfoWrapper(taskAttemptId, shuffleHandleInfo).retrievePartitionAssignment(partitionId) could be replaced with shuffleHandleInfo.getLatestAssignmentPlan(taskAttemptId).get(partitionId). To avoid fetching the whole plan for each partition, this wrapper caches it. Looks like this is the only purpose of this class, as indicated by the comment above
/** This class is to wrap the shuffleHandleInfo to speed up the partitionAssignment getting. */
Why can't ShuffleHandleInfo cache the partition assignment? Do you need ShuffleHandleInfo to be immutable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't ShuffleHandleInfo cache the partition assignment?
Yes. the cache is also OK, let me think twice
Do you need ShuffleHandleInfo to be immutable?
Needn't. The shuffleHandleInfo will be updated when the reassign occurs. But this will happen in driver side and then send back the latest handleInfo to executor task side.
client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java
Outdated
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/spark/shuffle/writer/ShuffleHandleInfoWrapper.java
Outdated
Show resolved
Hide resolved
…ter/ShuffleHandleInfoWrapper.java Co-authored-by: Enrico Minack <[email protected]>
client-spark/common/src/main/java/org/apache/spark/shuffle/writer/ShuffleHandleInfoWrapper.java
Outdated
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/spark/shuffle/ShuffleHandleInfo.java
Outdated
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/spark/shuffle/writer/WriteBufferManager.java
Show resolved
Hide resolved
...spark/common/src/main/java/org/apache/uniffle/shuffle/manager/ShuffleManagerGrpcService.java
Outdated
Show resolved
Hide resolved
client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java
Outdated
Show resolved
Hide resolved
client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java
Outdated
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/spark/shuffle/ShuffleHandleInfo.java
Outdated
Show resolved
Hide resolved
internal-client/src/main/java/org/apache/uniffle/client/impl/grpc/ShuffleServerGrpcClient.java
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/spark/shuffle/RssSparkConfig.java
Outdated
Show resolved
Hide resolved
|
If one server becomes a faulty server, all tasks will change the assignment, won't they? Why do we need to record every task for a new assignment? |
I want to clarity that receivingFailureServer should be scoped for partition block data rather than tasks. Because sometimes server will in high watermark with too much requests, so they will effect these partitioned data in that time. That means these partitioned data should be reassigned to another server. If this is not happened in other partitions, the assign will not be changed.
I don't catch your thought about task -> assignment. |
I got your point. You just record one reassignment but you re-balance them if you according to hash or range. It's ok that we store one assignment. But we should consider two class names. Could we return a high load error code to the server when the shuffle server has too high load? Is it a failure when we just return a high load error code? Maybe we don't need to change this class name. Should we have a strategy class to handle the difference between faulty servers and high load servers. |
|
It looks there was agreement on the regular partition reassignment, that's good. Let's extending the topic to huge partition or high-load server that you defined.
Yes. this is OK. Actually this could be implemented in shuffle-server side. And All need things for this high-load reassignment have been supported in client side.
Yes. this also could be implemented in TaskAssignment side, actually this has been supported in the pervious implementation but for better understand, I removed this part. Anyway, I only do the regular reassignment here and ensure the expansibility for future development |
|
This could be as a pluggable strategy if need in the future. |
I want to strategy class to help us understand the class the |
Could you help directly review this class to leave the and comments and suggestions ? |
common/src/main/java/org/apache/uniffle/common/ReceivingFailureServer.java
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/spark/shuffle/writer/TaskAttemptAssignment.java
Show resolved
Hide resolved
jerqi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Do you need to update your description of this pull request?
Updated. Please take a look @EnricoMi If you have no problem for this, I will merge this. |
|
|
||
| import org.apache.uniffle.common.RemoteStorageInfo; | ||
|
|
||
| public abstract class ShuffleHandleInfoBase implements ShuffleHandleInfo, Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShuffleHandleInfoBase -> BaseShuffleHandleInfo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope the prefix of ShuffleHandleInfo could be placed in a near group,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base class is a more common name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like BaseShuffleHandleInfo, but a class name like {Interface}Base is common practice for a base implementation of an interface {Interface}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reserve this name.
client-spark/common/src/main/java/org/apache/spark/shuffle/RssSparkConfig.java
Show resolved
Hide resolved
client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java
Outdated
Show resolved
Hide resolved
| import org.apache.uniffle.proto.RssProtos; | ||
|
|
||
| /** This class holds the dynamic partition assignment for partition reassign mechanism. */ | ||
| public class MutableShuffleHandleInfo extends ShuffleHandleInfoBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A class should not be named after its usage, but after what it implements, as it has no knowledge about how it is being used, and it does not care, it provides only what it implements.
This handle info is used for fault tolerance, but there is no fault tolerance built into this class. It uses per-replica server infos to implement getAvailablePartitionServersForWriter() and getAllPartitionServersForReader().
|
|
||
| import org.apache.uniffle.common.RemoteStorageInfo; | ||
|
|
||
| public abstract class ShuffleHandleInfoBase implements ShuffleHandleInfo, Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like BaseShuffleHandleInfo, but a class name like {Interface}Base is common practice for a base implementation of an interface {Interface}.
client-spark/common/src/main/java/org/apache/spark/shuffle/writer/TaskAttemptAssignment.java
Outdated
Show resolved
Hide resolved
| private boolean mutable = false; | ||
|
|
||
| public TaskAttemptAssignment(long taskAttemptId, ShuffleHandleInfo shuffleHandleInfo) { | ||
| this.assignment = shuffleHandleInfo.getAvailablePartitionServersForWriter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after removing mutable completely (see below), this should be changed to:
| this.assignment = shuffleHandleInfo.getAvailablePartitionServersForWriter(); | |
| this.update(shuffleHandleInfo); |
|
Gentle ping @EnricoMi |
|
Merged. Thanks for your review @qqqttt123 @jerqi @dingshun3016 @xumanbu @EnricoMi Feel free to discuss more if you have any suggestion @EnricoMi . |
What changes were proposed in this pull request?
Reassign whole process
Always using the latest assignment
To acheive always using the latest assignment, I introduce the
TaskAttemptAssignmentto get the latest assignment for current task. The creating process of AddBlockEvent also will apply the latest assignment byTaskAttemptAssignmentAnd it will be updated by the
reassignOnBlockSendFailurerpc.That means the original reassign rpc response will be refactored and replaced by the whole latest
shuffleHandleInfo.Why are the changes needed?
This PR is the subtask for #1608.
Leverging the #1615 / #1610 / #1609, we have implemented the reassign servers mechansim when write client encounters the server failure or unhealthy. But this is not good enough that will not share the faulty server state to the unstarted tasks and latter
AddBlockEvent.Does this PR introduce any user-facing change?
Yes.
How was this patch tested?
Unit and integration tests.
Integration tests as follows:
PartitionBlockDataReassignBasicTestto validate the reassign mechanism validPartitionBlockDataReassignMultiTimesTestis to test the partition reassign mechanism of multiple retries.