[#1608][part-5] feat(spark3): always use the available assignment #1652

zuston · 2024-04-17T02:28:03Z

What changes were proposed in this pull request?

make the write client always use the latest available assignment for the following writing when the block reassign happens.
support multi time retry for partition reassign
limit the max reassign server num of one partition
refactor the reassign rpc
rename the faultyServer -> receivingFailureServer.

Reassign whole process

Always using the latest assignment

To acheive always using the latest assignment, I introduce the TaskAttemptAssignment to get the latest assignment for current task. The creating process of AddBlockEvent also will apply the latest assignment by TaskAttemptAssignment

And it will be updated by the reassignOnBlockSendFailure rpc.
That means the original reassign rpc response will be refactored and replaced by the whole latest shuffleHandleInfo.

Why are the changes needed?

This PR is the subtask for #1608.

Leverging the #1615 / #1610 / #1609, we have implemented the reassign servers mechansim when write client encounters the server failure or unhealthy. But this is not good enough that will not share the faulty server state to the unstarted tasks and latter AddBlockEvent .

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Unit and integration tests.

Integration tests as follows:

PartitionBlockDataReassignBasicTest to validate the reassign mechanism valid
PartitionBlockDataReassignMultiTimesTest is to test the partition reassign mechanism of multiple retries.

…and load balance for huge partition

github-actions · 2024-04-17T02:59:59Z

Test Results

2 391 files + 42 2 391 suites +42 4h 41m 32s ⏱️ + 29m 39s
925 tests + 6 924 ✅ + 7 1 💤 ±0 0 ❌ ±0
10 712 runs +114 10 698 ✅ +116 14 💤 ±0 0 ❌ ±0

Results for commit f44f6a4. ± Comparison against base commit 60fce8e.

This pull request removes 5 and adds 11 tests. Note that renamed tests count towards both.

org.apache.spark.shuffle.ShuffleHandleInfoTest ‑ testCreatePartitionReplicaTracking
org.apache.spark.shuffle.ShuffleHandleInfoTest ‑ testListAllPartitionAssignmentServers
org.apache.spark.shuffle.ShuffleHandleInfoTest ‑ testReassignment
org.apache.uniffle.shuffle.manager.ShuffleManagerServerFactoryTest ‑ testShuffleManagerServerType
org.apache.uniffle.test.PartitionBlockDataReassignTest ‑ resultCompareTest

org.apache.hadoop.mapred.SortWriteBufferTest ‑ testSortBufferIterator
org.apache.spark.shuffle.handle.MutableShuffleHandleInfoTest ‑ testCreatePartitionReplicaTracking
org.apache.spark.shuffle.handle.MutableShuffleHandleInfoTest ‑ testListAllPartitionAssignmentServers
org.apache.spark.shuffle.handle.MutableShuffleHandleInfoTest ‑ testUpdateAssignment
org.apache.spark.shuffle.writer.RssShuffleWriterTest ‑ reassignMultiTimesForOnePartitionIdTest
org.apache.spark.shuffle.writer.RssShuffleWriterTest ‑ refreshAssignmentTest
org.apache.uniffle.server.buffer.ShuffleBufferManagerTest ‑ blockSizeMetricsTest
org.apache.uniffle.shuffle.manager.ShuffleManagerServerFactoryTest ‑ testShuffleManagerServerType{ServerType}[1]
org.apache.uniffle.shuffle.manager.ShuffleManagerServerFactoryTest ‑ testShuffleManagerServerType{ServerType}[2]
org.apache.uniffle.test.PartitionBlockDataReassignBasicTest ‑ resultCompareTest
…

♻️ This comment has been updated with latest results.

EnricoMi · 2024-04-17T06:14:15Z

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/ShuffleHandleInfoWrapper.java

+import org.apache.uniffle.common.exception.RssException;
+
+/** This class is to wrap the shuffleHandleInfo to speed up the partitionAssignment getting. */
+public class ShuffleHandleInfoWrapper {


Why not adding this caching to ShuffleHandleInfo? With ShuffleHandleInfo.updateAssignmentPlan fetching and storing latestAssignment and ShuffleHandleInfo.getPartitionAssignment(taskAttemptId) providing the assignment from that Map.

ShuffleHandleInfo is always maintain the latest version in spark driver shuffleManager. But for shuffle writer, the holding shuffleHandleInfo is partially latest, which is updated by the grpc handle.

If using your way, it maybe cause some questions. Because the same object has different usage. It looks not clear.

Feel free to discuss more.

You are saying you do not want the ShuffleHandleInfo hold by the shuffle writer to change?

Not accurate, I hope the updated latest cache is not maintained into the shuffleHandleInfo. Because it is used by the shuffle writer, but the shuffleHandleInfo is transferred by the grpc to writer(updated by the reassign rpc everytime). So this cache is not good for the shareable for all the tasks. maybe using a independent handle wrapper to hold cache is more clear.

Calling ShuffleHandleInfoWrapper(taskAttemptId, shuffleHandleInfo).retrievePartitionAssignment(partitionId) could be replaced with shuffleHandleInfo.getLatestAssignmentPlan(taskAttemptId).get(partitionId). To avoid fetching the whole plan for each partition, this wrapper caches it. Looks like this is the only purpose of this class, as indicated by the comment above

/** This class is to wrap the shuffleHandleInfo to speed up the partitionAssignment getting. */

Why can't ShuffleHandleInfo cache the partition assignment? Do you need ShuffleHandleInfo to be immutable?

Why can't ShuffleHandleInfo cache the partition assignment?

Yes. the cache is also OK, let me think twice

Do you need ShuffleHandleInfo to be immutable?

Needn't. The shuffleHandleInfo will be updated when the reassign occurs. But this will happen in driver side and then send back the latest handleInfo to executor task side.

client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/ShuffleHandleInfoWrapper.java

…ter/ShuffleHandleInfoWrapper.java Co-authored-by: Enrico Minack <[email protected]>

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/ShuffleHandleInfoWrapper.java

client-spark/common/src/main/java/org/apache/spark/shuffle/ShuffleHandleInfo.java

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/WriteBufferManager.java

...spark/common/src/main/java/org/apache/uniffle/shuffle/manager/ShuffleManagerGrpcService.java

client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java

…rvers have been replacemented

client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java

client-spark/common/src/main/java/org/apache/spark/shuffle/ShuffleHandleInfo.java

internal-client/src/main/java/org/apache/uniffle/client/impl/grpc/ShuffleServerGrpcClient.java

client-spark/common/src/main/java/org/apache/spark/shuffle/RssSparkConfig.java

server/src/main/java/org/apache/uniffle/server/ShuffleServer.java

jerqi · 2024-04-29T06:32:18Z

If one server becomes a faulty server, all tasks will change the assignment, won't they? Why do we need to record every task for a new assignment?

zuston · 2024-04-29T06:51:39Z

If one server becomes a faulty server, all tasks will change the assignment, won't they? Why do we need to record every task for a new assignment?

I want to clarity that receivingFailureServer should be scoped for partition block data rather than tasks. Because sometimes server will in high watermark with too much requests, so they will effect these partitioned data in that time. That means these partitioned data should be reassigned to another server. If this is not happened in other partitions, the assign will not be changed.

Why do we need to record every task for a new assignment?

I don't catch your thought about task -> assignment.

jerqi · 2024-04-29T07:03:11Z

If one server becomes a faulty server, all tasks will change the assignment, won't they? Why do we need to record every task for a new assignment?

I want to clarity that receivingFailureServer should be scoped for partition block data rather than tasks. Because sometimes server will in high watermark with too much requests, so they will effect these partitioned data in that time. That means these partitioned data should be reassigned to another server. If this is not happened in other partitions, the assign will not be changed.

Why do we need to record every task for a new assignment?

I don't catch your thought about task -> assignment.

I got your point. You just record one reassignment but you re-balance them if you according to hash or range. It's ok that we store one assignment. But we should consider two class names.

receivingFailureServer

Could we return a high load error code to the server when the shuffle server has too high load? Is it a failure when we just return a high load error code?

TaskAssignment

Maybe we don't need to change this class name. Should we have a strategy class to handle the difference between faulty servers and high load servers.

zuston · 2024-04-29T08:10:45Z

It looks there was agreement on the regular partition reassignment, that's good.

Let's extending the topic to huge partition or high-load server that you defined.

Could we return a high load error code to the server when the shuffle server has too high load? Is it a failure when we just return a high load error code?

Yes. this is OK. Actually this could be implemented in shuffle-server side. And All need things for this high-load reassignment have been supported in client side.

Maybe we don't need to change this class name. Should we have a strategy class to handle the difference between faulty servers and high load servers.

Yes. this also could be implemented in TaskAssignment side, actually this has been supported in the pervious implementation but for better understand, I removed this part.

Anyway, I only do the regular reassignment here and ensure the expansibility for future development

jerqi · 2024-04-29T09:49:12Z

It looks there was agreement on the regular partition reassignment, that's good.

Let's extending the topic to huge partition or high-load server that you defined.

Could we return a high load error code to the server when the shuffle server has too high load? Is it a failure when we just return a high load error code?

Yes. this is OK. Actually this could be implemented in shuffle-server side. And All need things for this high-load reassignment have been supported in client side.

Maybe we don't need to change this class name. Should we have a strategy class to handle the difference between faulty servers and high load servers.

Yes. this also could be implemented in TaskAssignment side, actually this has been supported in the pervious implementation but for better understand, I removed this part.

Anyway, I only do the regular reassignment here and ensure the expansibility for future development

I just feel that receivingFailureServer isn't a good name.
Could you extract a strategy class for the fault tolerance?

zuston · 2024-04-29T12:44:43Z

Could you extract a strategy class for the fault tolerance?

This could be as a pluggable strategy if need in the future.

jerqi · 2024-04-30T02:01:04Z

Could you extract a strategy class for the fault tolerance?

This could be as a pluggable strategy if need in the future.

I want to strategy class to help us understand the class the TaskAttemptAssignment.

zuston · 2024-04-30T02:33:14Z

Could you extract a strategy class for the fault tolerance?

This could be as a pluggable strategy if need in the future.

I want to strategy class to help us understand the class the TaskAttemptAssignment.

Could you help directly review this class to leave the and comments and suggestions ?

common/src/main/java/org/apache/uniffle/common/ReceivingFailureServer.java

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/TaskAttemptAssignment.java

jerqi

LGTM, Do you need to update your description of this pull request?

zuston · 2024-04-30T06:20:37Z

LGTM, Do you need to update your description of this pull request?

Updated.

Please take a look @EnricoMi If you have no problem for this, I will merge this.

qqqttt123 · 2024-04-30T06:36:15Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/ShuffleHandleInfoBase.java

+
+import org.apache.uniffle.common.RemoteStorageInfo;
+
+public abstract class ShuffleHandleInfoBase implements ShuffleHandleInfo, Serializable {


ShuffleHandleInfoBase -> BaseShuffleHandleInfo?

I hope the prefix of ShuffleHandleInfo could be placed in a near group,

Base class is a more common name.

I like BaseShuffleHandleInfo, but a class name like {Interface}Base is common practice for a base implementation of an interface {Interface}.

Let's reserve this name.

client-spark/common/src/main/java/org/apache/spark/shuffle/RssSparkConfig.java

client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java

EnricoMi · 2024-04-30T07:37:29Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/MutableShuffleHandleInfo.java

+import org.apache.uniffle.proto.RssProtos;
+
+/** This class holds the dynamic partition assignment for partition reassign mechanism. */
+public class MutableShuffleHandleInfo extends ShuffleHandleInfoBase {


A class should not be named after its usage, but after what it implements, as it has no knowledge about how it is being used, and it does not care, it provides only what it implements.

This handle info is used for fault tolerance, but there is no fault tolerance built into this class. It uses per-replica server infos to implement getAvailablePartitionServersForWriter() and getAllPartitionServersForReader().

EnricoMi · 2024-04-30T07:40:56Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/ShuffleHandleInfoBase.java

+
+import org.apache.uniffle.common.RemoteStorageInfo;
+
+public abstract class ShuffleHandleInfoBase implements ShuffleHandleInfo, Serializable {


I like BaseShuffleHandleInfo, but a class name like {Interface}Base is common practice for a base implementation of an interface {Interface}.

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/TaskAttemptAssignment.java

EnricoMi · 2024-04-30T07:47:15Z

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/TaskAttemptAssignment.java

+  private boolean mutable = false;
+
+  public TaskAttemptAssignment(long taskAttemptId, ShuffleHandleInfo shuffleHandleInfo) {
+    this.assignment = shuffleHandleInfo.getAvailablePartitionServersForWriter();


after removing mutable completely (see below), this should be changed to:

Suggested change

this.assignment = shuffleHandleInfo.getAvailablePartitionServersForWriter();

this.update(shuffleHandleInfo);

zuston · 2024-05-06T09:26:00Z

All done. PTAL again @jerqi @EnricoMi

zuston · 2024-05-07T06:46:33Z

Gentle ping @EnricoMi

zuston · 2024-05-09T03:31:50Z

Merged. Thanks for your review @qqqttt123 @jerqi @dingshun3016 @xumanbu @EnricoMi

Feel free to discuss more if you have any suggestion @EnricoMi .

[apache#1608][part-5] feat(spark3): always use the latest assignment …

450a9ab

…and load balance for huge partition

zuston requested review from EnricoMi and jerqi April 17, 2024 02:31

checkstyle fix

697f946

EnricoMi reviewed Apr 17, 2024

View reviewed changes

client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java Outdated Show resolved Hide resolved

EnricoMi reviewed Apr 17, 2024

View reviewed changes

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/ShuffleHandleInfoWrapper.java Outdated Show resolved Hide resolved

Update client-spark/common/src/main/java/org/apache/spark/shuffle/wri…

c56faab

…ter/ShuffleHandleInfoWrapper.java Co-authored-by: Enrico Minack <[email protected]>

EnricoMi reviewed Apr 17, 2024

View reviewed changes

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/ShuffleHandleInfoWrapper.java Outdated Show resolved Hide resolved

EnricoMi reviewed Apr 17, 2024

View reviewed changes

client-spark/common/src/main/java/org/apache/spark/shuffle/ShuffleHandleInfo.java Outdated Show resolved Hide resolved

dingshun3016 reviewed Apr 19, 2024

View reviewed changes

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/WriteBufferManager.java Show resolved Hide resolved

Remove the naming of plan and TaskAttemptAssignment

a2bd10d

dingshun3016 reviewed Apr 22, 2024

View reviewed changes

...spark/common/src/main/java/org/apache/uniffle/shuffle/manager/ShuffleManagerGrpcService.java Outdated Show resolved Hide resolved

client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java Outdated Show resolved Hide resolved

zuston added 3 commits April 22, 2024 16:43

use concurrenthashmap + hashset to ensure thread safe

1dc73b5

register to shuffle-servers for those partitions that their faulty se…

6b4afac

…rvers have been replacemented

introduce the max reassign server num for partition to check

d1c333a

xumanbu reviewed Apr 22, 2024

View reviewed changes

client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java Outdated Show resolved Hide resolved

dingshun3016 reviewed Apr 22, 2024

View reviewed changes

client-spark/common/src/main/java/org/apache/spark/shuffle/ShuffleHandleInfo.java Outdated Show resolved Hide resolved

zuston added 4 commits April 23, 2024 10:11

fix current modification exceptions

277ef42

only register those newly added partition servers

41bcab8

add the test for the updateAssignment

2bbda30

fix tests

52e0b3e

zuston requested review from EnricoMi, dingshun3016 and xumanbu April 23, 2024 05:34

dingshun3016 reviewed Apr 23, 2024

View reviewed changes

internal-client/src/main/java/org/apache/uniffle/client/impl/grpc/ShuffleServerGrpcClient.java Show resolved Hide resolved

jerqi reviewed Apr 23, 2024

View reviewed changes

client-spark/common/src/main/java/org/apache/spark/shuffle/RssSparkConfig.java Outdated Show resolved Hide resolved

jerqi reviewed Apr 23, 2024

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleServer.java Show resolved Hide resolved

throw error in netty client

ea974c5

jerqi reviewed Apr 30, 2024

View reviewed changes

common/src/main/java/org/apache/uniffle/common/ReceivingFailureServer.java Show resolved Hide resolved

jerqi reviewed Apr 30, 2024

View reviewed changes

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/TaskAttemptAssignment.java Show resolved Hide resolved

add comments for taskattemptAssignment

e127cf3

jerqi previously approved these changes Apr 30, 2024

View reviewed changes

qqqttt123 reviewed Apr 30, 2024

View reviewed changes

client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java Outdated Show resolved Hide resolved

EnricoMi reviewed Apr 30, 2024

View reviewed changes

zuston added 2 commits May 6, 2024 10:32

remove mutable var followed by emricoMi

9c60ef4

rename followed by jerqi

5f9e061

zuston dismissed jerqi’s stale review via 5f9e061 May 6, 2024 02:38

partitionReassign -> reassign

f44f6a4

zuston requested review from EnricoMi, dingshun3016, jerqi and qqqttt123 May 7, 2024 02:01

jerqi approved these changes May 9, 2024

View reviewed changes

zuston merged commit 30bf8dc into apache:master May 9, 2024

zuston mentioned this pull request May 11, 2024

Support client partition data reassign #1608

Open

9 tasks


		import org.apache.uniffle.common.RemoteStorageInfo;

		public abstract class ShuffleHandleInfoBase implements ShuffleHandleInfo, Serializable {

	this.assignment = shuffleHandleInfo.getAvailablePartitionServersForWriter();
	this.update(shuffleHandleInfo);

[#1608][part-5] feat(spark3): always use the available assignment #1652

[#1608][part-5] feat(spark3): always use the available assignment #1652

Uh oh!

Conversation

zuston commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Reassign whole process

Always using the latest assignment

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuston Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuston Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerqi commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuston commented Apr 29, 2024

Uh oh!

jerqi commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuston commented Apr 29, 2024

Uh oh!

jerqi commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuston commented Apr 29, 2024

Uh oh!

jerqi commented Apr 30, 2024

Uh oh!

zuston commented Apr 30, 2024

Uh oh!

Uh oh!

Uh oh!

jerqi left a comment

Choose a reason for hiding this comment

Uh oh!

zuston commented Apr 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zuston commented Apr 17, 2024 •

edited

Loading

github-actions bot commented Apr 17, 2024 •

edited

Loading

zuston Apr 17, 2024 •

edited

Loading

zuston Apr 17, 2024 •

edited

Loading

jerqi commented Apr 29, 2024 •

edited

Loading

jerqi commented Apr 29, 2024 •

edited

Loading

jerqi commented Apr 29, 2024 •

edited

Loading