Cuda perf tuning #2307

awni · 2025-06-20T19:58:56Z

A few minor fixes which also have a tiny impact on perf (~1-2 tok/sec) but are mostly the right thing to do.

awni · 2025-06-20T19:59:54Z

mlx/backend/cuda/allocator.cpp

+constexpr int page_size = 16384;
+


The default page size on linux is 4096.. which is quite small. I use the same value we use in macos here.. which results in better cache hits and reuse (since we round up to the page size).

angeloskath

👍

perf tuning

72e21b7

awni commented Jun 20, 2025

View reviewed changes

fix adding inputs arrays in matmul / srot

1a0e884

awni force-pushed the cuda_perf_tuning branch from 2fe6563 to 1a0e884 Compare June 20, 2025 20:01

awni added 2 commits June 20, 2025 13:01

format

6bb0b25

fix

de190bf

angeloskath approved these changes Jun 20, 2025

View reviewed changes

awni merged commit c9a9180 into ml-explore:main Jun 20, 2025
6 checks passed

awni deleted the cuda_perf_tuning branch June 24, 2025 15:51

BrewTestBot mentioned this pull request Jul 25, 2025

mlx 0.27.1 Homebrew/homebrew-core#231260

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cuda perf tuning #2307

Cuda perf tuning #2307

Uh oh!

awni commented Jun 20, 2025

Uh oh!

awni Jun 20, 2025

Uh oh!

angeloskath left a comment

Uh oh!

Uh oh!

Uh oh!

Cuda perf tuning #2307

Cuda perf tuning #2307

Uh oh!

Conversation

awni commented Jun 20, 2025

Uh oh!

awni Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!