Vectorize ScaleCUDAKernel in DistributedFusedLamb #40011

sneaxiy · 2022-03-01T03:14:12Z

PR types

Performance optimization

PR changes

OPs

Describe

对DistributedFusedLamb里的ScaleCUDAKernel做向量化优化，VecSize = 8 (FP16)。

单个batch的时间：

时间(ms)	性能数据	提升百分比
优化前（657次平均）	1268765.3	-
优化后（657次平均）	750867.1	+40.8%

注：这个Kernel在单机8卡时用不到，主要是单机单卡才会用到。

paddle-bot-old · 2022-03-01T03:14:15Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

limin2021

LGTM.

vec scale kernel

1907197

sneaxiy requested a review from limin2021 March 1, 2022 06:36

limin2021 approved these changes Mar 2, 2022

View reviewed changes

sneaxiy merged commit 2e6548a into PaddlePaddle:develop Mar 2, 2022

sneaxiy deleted the vec_scale_kernel branch March 2, 2022 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize ScaleCUDAKernel in DistributedFusedLamb #40011

Vectorize ScaleCUDAKernel in DistributedFusedLamb #40011

Uh oh!

sneaxiy commented Mar 1, 2022 •

edited

Loading

Uh oh!

paddle-bot-old bot commented Mar 1, 2022

Uh oh!

limin2021 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vectorize ScaleCUDAKernel in DistributedFusedLamb #40011

Vectorize ScaleCUDAKernel in DistributedFusedLamb #40011

Uh oh!

Conversation

sneaxiy commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

Uh oh!

paddle-bot-old bot commented Mar 1, 2022

Uh oh!

limin2021 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sneaxiy commented Mar 1, 2022 •

edited

Loading