Skip to content

Conversation

beiw-nv
Copy link

@beiw-nv beiw-nv commented Aug 1, 2025

In multi-gpu runs, the cute implementation of _flash_attn_fwd returns incorrect values for gpu != 0. This can be fixed with torch.cuda.device context manager as sugguested issue1782:

@tridao
Copy link
Member

tridao commented Aug 1, 2025

I'm not sure it's "race condition", I suspect it just launches the kernel on cuda device 0 even when the data is on cuda device 1

@beiw-nv
Copy link
Author

beiw-nv commented Aug 1, 2025

I see. When do you expect we will have Blackwell support for _flash_attn_bwd?

@tridao
Copy link
Member

tridao commented Aug 1, 2025

3-4 weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants