Skip to content

Comparison of different solutions for deterministic backward #747

@defei-coder

Description

@defei-coder

Thanks to @tridao for reviewing PR722 and mentioned a new solution.
In approach, @tridao proposed a new idea which calculate different local dQ partitioned based on the number of SMs and reduce sum them in later kernel. It is a concise solution, but the number of blocks is small, unable to fully utilize GPU resources results in poor performance.
I conducted a performance comparison between PR722 and approach. Based on the attention module end-to-end timing metric, PR722 is proved to be more efficient.
I could uderstand your design philosophy and agree that is a good solution, but in real case we found that PR722 is more efficient. We may need to testing with more cases to get the final conclusion. Thus I suggest to merge PR772 into the lastest version.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions