Skip to content

Conversation

@liym27
Copy link
Contributor

@liym27 liym27 commented Dec 20, 2024

PR Category

Auto Parallel

PR Types

Bug fixes

Description

fix communication hang issue on the GPU-H

依赖PR #70615

TODO: fix vpp hang

PCard-86802

@paddle-bot
Copy link

paddle-bot bot commented Dec 20, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@liym27 liym27 changed the title fix Auto-Parallel | Comm】fix communication hang issue on GPU-H Dec 24, 2024
'1',
]:
op.set_execution_stream(
ExecutionStreamType.DefaultStream.value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recv通信为啥无法在recv_stream执行?

Copy link
Contributor Author

@liym27 liym27 Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recv通信为啥无法在recv_stream执行?

在 recv_stream 会hang. 在计算流可以正确执行、不会hang,且因为需要recv完成后才能后续计算,所有对性能应该无影响

recorder_op.set_str_attr("event_to_record", name)
waiter_op.set_str_array_attr("events_to_wait", [name])

def _split_program_into_forward_backward_optimize_recv_send(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数特别长(300+行代码),有许多if-else判断,但其核心流程都是相似的:在一个有序的program序列中(fwd -> recv_fwd -> fwd -> bwd -> send_bwd -> opt),指定将某个op保留在其中某个program,在其它program中将该op删除,同时为后继的program建立新的数据依赖关系。
这个操作可否抽象成可复用的公共函数,以降低后续理解和维护代码的成本?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数特别长(300+行代码),有许多if-else判断,但其核心流程都是相似的:在一个有序的program序列中(fwd -> recv_fwd -> fwd -> bwd -> send_bwd -> opt),指定将某个op保留在其中某个program,在其它program中将该op删除,同时为后继的program建立新的数据依赖关系。 这个操作可否抽象成可复用的公共函数,以降低后续理解和维护代码的成本?

Thx,这里我抽象出来,以提高可读性降低维护成本。


def set_pir_skip_gc_vars(num_micro_batches, job_types, sub_programs, jobs):
def set_pir_skip_gc_vars(
num_micro_batches, job_types, sub_programs, jobs, type_to_skip_gc_vars={}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为何需要从外部传入type_to_skip_gc_vars,有一些数据依赖关系信息不包含在sub_programs的表达中吗?

Copy link
Contributor Author

@liym27 liym27 Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为何需要从外部传入type_to_skip_gc_vars,有一些数据依赖关系信息不包含在sub_programs的表达中吗?

这里主要是处理这种场景:recv_fwd 中 send 的变量,被 fwd 和 bwd 同时使用,但 fwd 阶段不能gc。sub_programs 可以表达,但需要“重复”分析依赖关系,而依赖关系在1f1b split program 时分析过且代码较复杂,为避免重复,这里直接外部传入了。

fixed
#70615

# it will be used by other sub_programs
type_to_var_names[job_type].add(op.attrs()["output_name"])
if job_type in ["backward", "backward_w"]:
if os.getenv("FLAGS_enable_p2p_comm_opt", 0) in [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个开关是否直接去除或默认打开?如果保留的话建议换个名字,FLAGS_enable_p2p_comm_opt从名字无法方便地知道底层具体做的是什么通信优化(实际并不是通信优化,只是改变了编排方式,也不会提升性能)。

Copy link
Contributor Author

@liym27 liym27 Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个开关是否直接去除或默认打开?如果保留的话建议换个名字,FLAGS_enable_p2p_comm_opt从名字无法方便地知道底层具体做的是什么通信优化(实际并不是通信优化,只是改变了编排方式,也不会提升性能)。

Thx, 可以。已线下沟通确认,将默认打开。

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Dec 28, 2024

Sorry to inform you that 4bdf88b's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@liym27 liym27 changed the title Auto-Parallel | Comm】fix communication hang issue on GPU-H 【Auto-Parallel | Comm】fix communication hang issue on GPU-H Dec 28, 2024
@liym27 liym27 force-pushed the fix_cross_comm_1215 branch 2 times, most recently from 48077ca to f67d7b6 Compare January 6, 2025 04:24
@liym27 liym27 force-pushed the fix_cross_comm_1215 branch from f67d7b6 to a4c1a21 Compare January 6, 2025 05:36
From00
From00 previously approved these changes Jan 6, 2025
Copy link
Contributor

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

fix code
@liym27 liym27 force-pushed the fix_cross_comm_1215 branch from c64fd32 to 0412f68 Compare January 6, 2025 13:15
Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liym27 liym27 merged commit ce844b8 into PaddlePaddle:develop Jan 7, 2025
30 of 31 checks passed
sneaxiy pushed a commit that referenced this pull request Jan 9, 2025
… scenarios and fix communication hang issue on GPU-H (#70687)

* 【Auto-Parallel】Refactor set_skip_gc_vars to adapt to all scenarios (#70615)

* 【Auto-Parallel | Comm】fix communication hang issue on GPU-H (#70360)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants