[Dist Dialect] Simple MoE training in PIR #66750

pkuzyc · 2024-07-29T12:55:34Z

PR Category

Auto Parallel

PR Types

New features

Description

Pcard-67164

Support simple MoE training in PIR:

Replace the MoE-related dist op, i.e. moe_sub_mesh_tensors and moe_global_mesh_tensor ([Dist Dialect] Add MoE-related api in PIR dist dialect #66462), with share_data op in remove_other_rank_op_pass.
Add sub_to_global_mesh reshard function in PIR, now only supports replicated Tensor. This is needed in global_norm_clip, the l2 norm of each expert's parameter should be resharded to other ranks, e.g. reshard the l2 norm of the parameter on ProcessMesh([0]) to ProcessMesh([0,1]).
NOTE: this is a preliminary version, need further refine to make it consistent with the dygraph reshard function.
Refine the MoE-related apis for multi-dimension process mesh, e.g. mesh=[[0,1],[2,3]], one expert on ranks [0,2] and the other one on ranks [1,3].

Additoinal: add an unit test for reshape_grad when its output x_grad needs reshard, corresponding to #67729

paddle-bot · 2024-07-29T12:55:39Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-ci-bot · 2024-08-20T03:08:24Z

Sorry to inform you that d1496fa's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

JZ-LIANG

LGTM

add sub_to_global_reshard function, the loss of PIR MoE demo is equal to dygraph auto parallel refine and rename the MoE apis

zhiqiu

LGTM

pkuzyc requested review from Aurelius84, changeyoung98, cxxly, cyber-pioneer and xiaoguoguo626807 as code owners July 29, 2024 12:55

pkuzyc force-pushed the pir_moe_pass branch 2 times, most recently from 089384d to ff60be1 Compare August 5, 2024 13:44

pkuzyc force-pushed the pir_moe_pass branch 2 times, most recently from 45ae555 to d1496fa Compare August 12, 2024 03:57

pkuzyc requested review from ForFishes, LiYuRio and zhiqiu as code owners August 12, 2024 03:57

pkuzyc marked this pull request as draft August 22, 2024 02:14

pkuzyc force-pushed the pir_moe_pass branch 3 times, most recently from 24bdd87 to bf6ddb8 Compare September 2, 2024 14:04

pkuzyc marked this pull request as ready for review September 2, 2024 14:06

pkuzyc force-pushed the pir_moe_pass branch 2 times, most recently from caa33b6 to 9a7205f Compare September 6, 2024 06:57

JZ-LIANG previously approved these changes Sep 9, 2024

View reviewed changes

pkuzyc dismissed JZ-LIANG’s stale review via 99b6d95 September 9, 2024 03:46

pkuzyc force-pushed the pir_moe_pass branch from 9a7205f to 99b6d95 Compare September 9, 2024 03:46

pkuzyc added 6 commits September 9, 2024 14:47

refine and rename the MoE apis

cf7a64f

add error test in ut

83c557c

add pass and unit test for pir moe model

eadbaeb

add sub_to_global_reshard function, the loss of PIR MoE demo is equal to dygraph auto parallel refine and rename the MoE apis

adapt arg type in shard_dataloader

6de8fbe

adapt grad_clip and CBroadcastOp in pir

38d13df

refine moe api to support multi-dimension process mesh

fccbc62

pkuzyc added 6 commits September 9, 2024 14:47

refine moe api

e462600

add spmd rule for ont_hot and cumsum in yaml

ae79244

revert shard_dataloader

75489b8

add unit test for reshape_grad with resharding x_grad

6bd7846

fix bug in unit test

5aa6565

revert shard_dataloader api args typing

af4f1e7

pkuzyc force-pushed the pir_moe_pass branch from 99b6d95 to af4f1e7 Compare September 9, 2024 06:49

zhangbo9674 and others added 3 commits September 10, 2024 07:40

add

0d10e29

add

3f523ad

Merge branch 'pr_68121' into pir_moe_pass

b314bda

XieYunshen approved these changes Sep 11, 2024

View reviewed changes

zyfncg approved these changes Sep 11, 2024

View reviewed changes

zhiqiu approved these changes Sep 11, 2024

View reviewed changes

JZ-LIANG merged commit d84c06f into PaddlePaddle:develop Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dist Dialect] Simple MoE training in PIR #66750

[Dist Dialect] Simple MoE training in PIR #66750

Uh oh!

pkuzyc commented Jul 29, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 29, 2024

Uh oh!

paddle-ci-bot bot commented Aug 20, 2024

Uh oh!

JZ-LIANG left a comment

Uh oh!

zhiqiu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Dist Dialect] Simple MoE training in PIR #66750

[Dist Dialect] Simple MoE training in PIR #66750

Uh oh!

Conversation

pkuzyc commented Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Jul 29, 2024

Uh oh!

paddle-ci-bot bot commented Aug 20, 2024

Uh oh!

JZ-LIANG left a comment

Choose a reason for hiding this comment

Uh oh!

zhiqiu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pkuzyc commented Jul 29, 2024 •

edited

Loading