Support paged attention for eagle overlap #12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Added support for paged attention by doing the following:
run_batch. Since we do not know the fill status of the most recent page (it is still running on the GPU), we allocate for the worst case number of pages starting from a new page.assign_draft_cache_locskernel in the draft decode to prepend the remaining unused cache locs from the previous page. We don't have to worry about freeing excess here because the allocator state is restored after draft.merge_cache_lockernel to the verify to prepend the remaining unused cache locs from the previous page. We store the excess pages into anevict_cache_loctensor, which is combined with the other pages that are evicted after accepting tokens.TODO
Correctness has been achieved for all attention backends other than FA3.
The code is correct when FA3 is used for the draft decode + extend, but not verify.