-
Couldn't load subscription status.
- Fork 2.1k
Description
Thanks to @tridao for reviewing PR722 and mentioned a new solution.
In approach, @tridao proposed a new idea which calculate different local dQ partitioned based on the number of SMs and reduce sum them in later kernel. It is a concise solution, but the number of blocks is small, unable to fully utilize GPU resources results in poor performance.
I conducted a performance comparison between PR722 and approach. Based on the attention module end-to-end timing metric, PR722 is proved to be more efficient.
I could uderstand your design philosophy and agree that is a good solution, but in real case we found that PR722 is more efficient. We may need to testing with more cases to get the final conclusion. Thus I suggest to merge PR772 into the lastest version.