SwiGLU kernel design choice. #3313
Unanswered
khalil-Hennara
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I was looking for the kernel at the unsloth repo, and looking for SwiGLU kernel, I've notice that you reshape the activation matrix, 'e and g' make it 2D matrix, while in the forward pass the matrix shape didn't change (batch_size, Sequance, h_d). I've noticed that the way we acess data the same for both kernel, my question is, is there a reson to reshape the matrix within the backward pass kernel, even thought we didn't reshape the matrix in the forward pass and works fine ?
Another question is about why did you re-compute the result 'h_row = f_row * g_row' and replace DW the gradient with this activation value.
I hope to answer @danielhanchen
Beta Was this translation helpful? Give feedback.
All reactions