You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I write a custom attention kernel fully by triton, it contain many triton kernels(>15).
In training, it need nearly 200 steps(50 minutes) to achieve the best performance. If I use official flash-attn kernel, It will not appear this phenomenon .
How can I solve this issue?
PS: I have tuned the triton config and don't use autotune in training.