list fused kernels in transformer written by Triton
Attention: I only test the correctness and speed on core kernel but on whole class, becuase there's some unsloved issued.
- Why error is significantly larger in default mode than INTERPRET mode, issue
Performance: improve 7% than torch kernel
Difference beween black line and red line is change the block size of GPU kernel
like this part in attention
- ffn2: working
- ffn2 + residual + norm
- linear + softmax