use triton kernel for training need warm up？

### Describe the issue

- I write a custom attention kernel fully by triton, it contain many triton kernels(>15). 
- In training, it need nearly 200 steps(50 minutes) to achieve the best performance. If I use official flash-attn kernel, It will not appear this phenomenon .
- How can I solve this issue？
- PS: I have tuned the triton config and don't use autotune in training.

<img width="974" height="836" alt="Image" src="https://github.com/user-attachments/assets/d20c5024-47d2-4cb9-a1ed-e872af32fe31" />

### Environment details

Triton:3.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use triton kernel for training need warm up？ #7717

Describe the issue

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

use triton kernel for training need warm up？ #7717

Description

Describe the issue

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions