Comparison of different solutions for deterministic backward

Thanks to @tridao for reviewing [PR722](https://github.com/Dao-AILab/flash-attention/pull/722) and mentioned a new solution.
In [approach](https://github.com/Dao-AILab/flash-attention/commit/732654583c2e640adc012ecb60e460bf19dcd9e3), @tridao  proposed a new idea which calculate different local dQ partitioned based on the number of SMs and reduce sum them in later kernel. It is a concise solution, but the number of blocks is small, unable  to fully utilize GPU resources  results in poor performance. 
    I conducted a performance comparison between [PR722](https://github.com/Dao-AILab/flash-attention/pull/722) and [approach](https://github.com/Dao-AILab/flash-attention/commit/732654583c2e640adc012ecb60e460bf19dcd9e3). Based on the attention module end-to-end timing metric, [PR722](https://github.com/Dao-AILab/flash-attention/pull/722) is proved to be more efficient.
    I could uderstand your design philosophy and agree that is a good solution, but in real case we found that PR722 is more efficient. We may need to testing with more cases to get the final conclusion. Thus I suggest to merge PR772 into the lastest version. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Comparison of different solutions for deterministic backward #747

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Comparison of different solutions for deterministic backward #747

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions