Skip to content

Conversation

klondenberg-bioptimus
Copy link

@klondenberg-bioptimus klondenberg-bioptimus commented Jul 10, 2025

There is an off-by-one error in flash_api.cpp / set_params_fprop() which can lead to memory access violations in our codebase.

Error description

in set_params_fprop, if scheduler_needs_semaphore == False and use_dynamic_split == True, the size of the tile_count_semaphore tensor is initialized here to be equal to the batch size. Which is theoretically sufficient:

int metadata_size = int(scheduler_needs_semaphore) + int(use_dynamic_split) * params.b;

But on line 975 a bit further below, even if scheduler_needs_semaphore==False, there is an offset of 1 being used to initialize num_splits_dynamic_ptr based off the raw data of the tile_count_semaphore tensor.

If num_splits_dynamic_ptr is now again being accessed at it's supposedly last valid element at an index equal to the batch size - 1, an illegal memory access occurs. Since it's just an off-by-one error, this might rarely be detectable, but it led to (rare) crashes and numerical issues in our CI. It could be detected by running some of our tests with "compute-sanitizer --padding 128 ... " while setting PYTORCH_NO_CUDA_MEMORY_CACHING=1 to disable pytorch's caching allocator ( without that, the access usually still hit memory that belonged to a valid allocation even if it was out of bounds ).

params.num_splits_dynamic_ptr = use_dynamic_split ? tile_count_semaphore.data_ptr<int>() + 1 : nullptr;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant