[CUDA] Fix reductions #2314

angeloskath · 2025-06-24T18:11:39Z

I am not too happy with this but I think it is a start.

I attempted an (almost certainly premature) "optimization" that I think complicated the code a bit and is probably not worth it. When the input is transposed instead of reading out of order and writing in order, I opted to transpose the output the same way so that we can read and write in order.

Here are some speed comparisons with PT on matrices MxN of type bfloat16.

It is obvious that our column reductions need work and we need to make our performance more consistent across sizes as well.

There are more weird issues to clear up like for instance the maximum complex number being -inf - inf j which is weird but also there is no real ordering for complex numbers so 🤷‍♂️

awni · 2025-06-26T13:19:10Z

mlx/backend/cuda/reduce/init_reduce.cu

+    out.set_data(allocator::malloc(out.nbytes()));
+  }
+
+  encoder.set_input_array(in);


I don't think you need to add the input here as the following kernel does not depend on it.

In fact maybe it makes sense to remove the input from the function signature entirely?

Unfortunately ReduceInit takes T not U for it's type so we need x, but yeah it isn't an input to the kernel... silly copy-paste.

Btw I haven't decided if ReduceInit should simply take U, I can't think of a place where it makes sense to look at T .

awni · 2025-06-26T13:20:53Z

mlx/backend/cuda/reduce/reduce_ops.cuh

 };

 struct Max {
  template <typename T>
-  __device__ T operator()(T a, T b) {
+  __device__ __forceinline__ T operator()(T a, T b) {


Curious, what motivated the __forceinline__? Just curious if we should be adding it in our other functors.

Probably not. I didn't see any perf difference. CUB uses it so I used it to see if it makes a difference 🤷‍♂️.

awni · 2025-06-26T13:25:34Z

mlx/backend/cuda/reduce/all_reduce.cu

+  int blocks, threads;
+  size_t block_step;
+  array x = in;
+
+  // Large array so allocate an intermediate and accumulate there
+  std::tie(blocks, threads, block_step) = get_args(x.size(), N_READS);


Suggested change

int blocks, threads;

size_t block_step;

array x = in;

// Large array so allocate an intermediate and accumulate there

std::tie(blocks, threads, block_step) = get_args(x.size(), N_READS);

array x = in;

// Large array so allocate an intermediate and accumulate there

auto [blocks, threads, block_step] = get_args(x.size(), N_READS);

The thing is that capturing structured bindings is a C++26 feature. I also changed the one in col_reduce to remove the warnings.

The thing is that capturing structured bindings is a C++26 feature

Ahh 😞 I forgot about that.

mlx/backend/cuda/reduce/col_reduce.cu

mlx/backend/cuda/reduce/reduce_utils.cuh

awni

Looks awesome! Left a few minor comments.

Lot's of green! More passing tests! 🚀

angeloskath added 15 commits June 20, 2025 21:49

Adapt the torch benchmark to run in CUDA

ab7c310

Add all reduce and atomic updates

9cf7ef1

Optimize all reduce a bit

b70a964

Simple row reduce

4d2b682

Working row reduce looped

cd523ff

Remove segmented reduce and fix row reduce

880751a

Add helpers and atomic kernel

abdb21f

Add comments and clean up

664d8e4

Working col reduce

cc4b995

Add an init reduce

818e8e6

Make sure softmax doesn't change the actual max

fd1d082

Fixes for transpositions and expands

8bd4bf2

More fixes for all reductions

a57a75b

Add a special case when not keeping the dims

a7faa04

Make check more general

d999675

angeloskath requested review from awni and jagrit06 June 25, 2025 23:48

awni reviewed Jun 26, 2025

View reviewed changes

mlx/backend/cuda/reduce/col_reduce.cu Outdated Show resolved Hide resolved

awni reviewed Jun 26, 2025

View reviewed changes

mlx/backend/cuda/reduce/col_reduce.cu Outdated Show resolved Hide resolved

awni reviewed Jun 26, 2025

View reviewed changes

mlx/backend/cuda/reduce/reduce_utils.cuh Outdated Show resolved Hide resolved

awni approved these changes Jun 26, 2025

View reviewed changes

Comments

bc60a31

angeloskath merged commit 772f471 into main Jun 27, 2025
5 checks passed

angeloskath deleted the cuda-reduce branch June 27, 2025 19:59

BrewTestBot mentioned this pull request Jul 25, 2025

mlx 0.27.1 Homebrew/homebrew-core#231260

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] Fix reductions #2314

[CUDA] Fix reductions #2314

Uh oh!

angeloskath commented Jun 24, 2025

Uh oh!

awni Jun 26, 2025

Uh oh!

awni Jun 26, 2025

Uh oh!

angeloskath Jun 27, 2025

Uh oh!

awni Jun 26, 2025

Uh oh!

angeloskath Jun 27, 2025

Uh oh!

awni Jun 26, 2025

Uh oh!

angeloskath Jun 27, 2025

Uh oh!

awni Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awni left a comment

Uh oh!

Uh oh!

Uh oh!

[CUDA] Fix reductions #2314

[CUDA] Fix reductions #2314

Uh oh!

Conversation

angeloskath commented Jun 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!