Skip to content

Conversation

@pkuzyc
Copy link
Contributor

@pkuzyc pkuzyc commented Jul 29, 2024

PR Category

Auto Parallel

PR Types

New features

Description

Pcard-67164

Support simple MoE training in PIR:

  1. Replace the MoE-related dist op, i.e. moe_sub_mesh_tensors and moe_global_mesh_tensor ([Dist Dialect] Add MoE-related api in PIR dist dialect #66462), with share_data op in remove_other_rank_op_pass.
    图片 1

  2. Add sub_to_global_mesh reshard function in PIR, now only supports replicated Tensor. This is needed in global_norm_clip, the l2 norm of each expert's parameter should be resharded to other ranks, e.g. reshard the l2 norm of the parameter on ProcessMesh([0]) to ProcessMesh([0,1]).
    NOTE: this is a preliminary version, need further refine to make it consistent with the dygraph reshard function.

  3. Refine the MoE-related apis for multi-dimension process mesh, e.g. mesh=[[0,1],[2,3]], one expert on ranks [0,2] and the other one on ranks [1,3].

Additoinal: add an unit test for reshape_grad when its output x_grad needs reshard, corresponding to #67729

@paddle-bot
Copy link

paddle-bot bot commented Jul 29, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@pkuzyc pkuzyc force-pushed the pir_moe_pass branch 2 times, most recently from 089384d to ff60be1 Compare August 5, 2024 13:44
@pkuzyc pkuzyc force-pushed the pir_moe_pass branch 2 times, most recently from 45ae555 to d1496fa Compare August 12, 2024 03:57
@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Aug 20, 2024

Sorry to inform you that d1496fa's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@pkuzyc pkuzyc marked this pull request as draft August 22, 2024 02:14
@pkuzyc pkuzyc force-pushed the pir_moe_pass branch 3 times, most recently from 24bdd87 to bf6ddb8 Compare September 2, 2024 14:04
@pkuzyc pkuzyc marked this pull request as ready for review September 2, 2024 14:06
@pkuzyc pkuzyc force-pushed the pir_moe_pass branch 2 times, most recently from caa33b6 to 9a7205f Compare September 6, 2024 06:57
JZ-LIANG
JZ-LIANG previously approved these changes Sep 9, 2024
Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JZ-LIANG JZ-LIANG merged commit d84c06f into PaddlePaddle:develop Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants