Skip to content

Conversation

@s5u13b
Copy link
Contributor

@s5u13b s5u13b commented Sep 9, 2022

PR types

Performance optimization

PR changes

OPs

Describe

  • Environment:
    • V100-32G, CUDA 11.2, cuDNN 8
  • Feature:
    • Replace the div and mod operation with fast_divmod operation.
    • Replace 1d gpu launch with 3d gpu launch.
    • Optimize the computation logic of input grad accumulation. Before optimization, the gpu launch config is based on the input data, and the input grad is accumulated through traverse the coresponding index of output mask data, which introduces the overhead of much output index computation. After optimization, the gpu launch config is based on the output data, the input grad is directly accumulated through looking up the max index of each output data, which saves the overhead of output index computation but requires atomic add operation.
    • (Config 0 is not optimized yet because paddle calls cudnn kernel in the config 0 of max_pool3d benchmarking.)
  • Performance (OP Benchmark):
Paddle Kernel Config ID Perf Before Perf After Improvement Perf of Pytorch
cudnn::pooling_bw_5d_kernel_max 0 1779.7us - - 725.72us
KernelMaxPool3DWithIdxGrad 1 6128.1us 677.62us 804.3% 725.83us

@paddle-bot-old paddle-bot-old bot added the contributor External developers label Sep 9, 2022
Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,补充下其他走相同Kernel的性能数据吧

@JamesLim-sy JamesLim-sy merged commit 0e563da into PaddlePaddle:develop Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants