Skip to content

Fix illegal memory access through off-by-one error in num_splits_dynamic_ptr init #1747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

klondenberg-bioptimus
Copy link

@klondenberg-bioptimus klondenberg-bioptimus commented Jul 10, 2025

There is an off-by-one error in flash_api.cpp / set_params_fprop() which can lead to memory access violations in our codebase.

Error description

in set_params_fprop, if scheduler_needs_semaphore == False and use_dynamic_split == True, the size of the tile_count_semaphore tensor is initialized here to be equal to the batch size. Which is theoretically sufficient:

int metadata_size = int(scheduler_needs_semaphore) + int(use_dynamic_split) * params.b;

But on line 975 a bit further below, even if scheduler_needs_semaphore==False, there is an offset of 1 being used to initialize num_splits_dynamic_ptr based off the raw data of the tile_count_semaphore tensor.

If num_splits_dynamic_ptr is now again being accessed at it's supposedly last valid element at an index equal to the batch size - 1, an illegal memory access occurs. Since it's just an off-by-one error, this might rarely be detectable, but it led to (rare) crashes and numerical issues in our CI. It could be detected by running some of our tests with "compute-sanitizer --padding 128 ... " while setting PYTORCH_NO_CUDA_MEMORY_CACHING=1 to disable pytorch's caching allocator ( without that, the access usually still hit memory that belonged to a valid allocation even if it was out of bounds ).

params.num_splits_dynamic_ptr = use_dynamic_split ? tile_count_semaphore.data_ptr<int>() + 1 : nullptr;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant