Skip to content

Cuda perf tuning #2307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 20, 2025
Merged

Cuda perf tuning #2307

merged 4 commits into from
Jun 20, 2025

Conversation

awni
Copy link
Member

@awni awni commented Jun 20, 2025

A few minor fixes which also have a tiny impact on perf (~1-2 tok/sec) but are mostly the right thing to do.

constexpr int page_size = 16384;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default page size on linux is 4096.. which is quite small. I use the same value we use in macos here.. which results in better cache hits and reuse (since we round up to the page size).

@awni awni force-pushed the cuda_perf_tuning branch from 2fe6563 to 1a0e884 Compare June 20, 2025 20:01
Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@awni awni merged commit c9a9180 into ml-explore:main Jun 20, 2025
6 checks passed
@awni awni deleted the cuda_perf_tuning branch June 24, 2025 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants