Skip to content

use triton kernel for training need warm up? #7717

@mdy666

Description

@mdy666

Describe the issue

  • I write a custom attention kernel fully by triton, it contain many triton kernels(>15).
  • In training, it need nearly 200 steps(50 minutes) to achieve the best performance. If I use official flash-attn kernel, It will not appear this phenomenon .
  • How can I solve this issue?
  • PS: I have tuned the triton config and don't use autotune in training.
Image

Environment details

Triton:3.2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions