Skip to content

fix race condition bug in cute _flash_attn_fwd in multiple gpu env #1793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

beiw-nv
Copy link

@beiw-nv beiw-nv commented Aug 1, 2025

In multi-gpu runs, the cute implementation of _flash_attn_fwd returns incorrect values for gpu != 0. This can be fixed with torch.cuda.device context manager as sugguested issue1782:

@tridao
Copy link
Member

tridao commented Aug 1, 2025

I'm not sure it's "race condition", I suspect it just launches the kernel on cuda device 0 even when the data is on cuda device 1

@beiw-nv
Copy link
Author

beiw-nv commented Aug 1, 2025

I see. When do you expect we will have Blackwell support for _flash_attn_bwd?

@tridao
Copy link
Member

tridao commented Aug 1, 2025

3-4 weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants