SwiGLU kernel design choice. #3313

khalil-Hennara · 2025-09-12T07:46:08Z

khalil-Hennara
Sep 12, 2025

Hi,

I was looking for the kernel at the unsloth repo, and looking for SwiGLU kernel, I've notice that you reshape the activation matrix, 'e and g' make it 2D matrix, while in the forward pass the matrix shape didn't change (batch_size, Sequance, h_d). I've noticed that the way we acess data the same for both kernel, my question is, is there a reson to reshape the matrix within the backward pass kernel, even thought we didn't reshape the matrix in the forward pass and works fine ?

Another question is about why did you re-compute the result 'h_row = f_row * g_row' and replace DW the gradient with this activation value.

I hope to answer @danielhanchen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SwiGLU kernel design choice. #3313

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

SwiGLU kernel design choice. #3313

Uh oh!

Uh oh!

khalil-Hennara Sep 12, 2025

Replies: 0 comments

khalil-Hennara
Sep 12, 2025