You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"""dropout_p should be set to 0.0 during evaluation
111
113
Supports multi-query and grouped-query attention (MQA/GQA) by passing in KV with fewer heads
112
114
than Q. Note that the number of heads in Q must be divisible by the number of heads in KV.
@@ -128,6 +130,8 @@ Arguments:
128
130
alibi_slopes: (nheads,) or (batch_size, nheads), fp32. A bias of
129
131
(-alibi_slope * |i + seqlen_k - seqlen_q - j|)
130
132
is added to the attention score of query i and key j.
133
+
deterministic: bool. Whether to use the deterministic implementation of the backward pass,
134
+
which is slightly slower and uses more memory. The forward pass is always deterministic.
131
135
Return:
132
136
out: (batch_size, seqlen, nheads, headdim).
133
137
"""
@@ -269,10 +273,12 @@ Implement sliding window attention (i.e., local attention). Thanks to [Mistral
269
273
AI](https://mistral.ai/) and in particular Timothée Lacroix for this
270
274
contribution. Sliding window was used in the [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/) model.
271
275
272
-
### 2.4: ALiBi (attention with linear bias)
276
+
### 2.4: ALiBi (attention with linear bias), deterministic backward pass.
273
277
274
278
Implement ALiBi (Press et el., 2021). Thanks to Sanghun Cho from Kakao Brain for this contribution.
275
279
280
+
Implement deterministic backward pass. Thanks to engineers from [Meituan](www.meituan.com) for this contribution.
281
+
276
282
## Performance
277
283
278
284
We present expected speedup (combined forward + backward pass) and memory savings from using FlashAttention against PyTorch standard attention, depending on sequence length, on different GPUs (speedup depends on memory bandwidth - we see more speedup on slower GPU memory).
0 commit comments