Triton-fused-kernel

list fused kernels in transformer written by Triton

Attention: I only test the correctness and speed on core kernel but on whole class, becuase there's some unsloved issued.

Fast cross entropy loss

Performance: improve 7% than torch kernel

Difference beween black line and red line is change the block size of GPU kernel

like this part in attention

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
fused_kernel		fused_kernel
img		img
LICENSE		LICENSE
README.md		README.md