Skip to content

Commit a78d2ee

Browse files
limin2021piotrekobi
authored andcommitted
[fix-doc-bug] Fix fused_attention_op english doc test=document_fix (PaddlePaddle#36803)
* Fix fused_attention english doc test=document_fix
1 parent d7741f3 commit a78d2ee

File tree

2 files changed

+33
-23
lines changed

2 files changed

+33
-23
lines changed

python/paddle/incubate/nn/functional/fused_transformer.py

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -194,24 +194,27 @@ def fused_multi_head_attention(x,
194194
Multi-Head Attention performs multiple parallel attention to jointly attending
195195
to information from different representation subspaces. This API only
196196
support self_attention. The pseudo code is as follows:
197-
if pre_layer_norm:
198-
out = layer_norm(x);
199-
out = linear(out) + qkv)bias
200-
else:
201-
out = linear(x) + bias;
202-
out = transpose(out, perm=[2, 0, 3, 1, 4]);
203-
# extract q, k and v from out.
204-
q = out[0:1,::]
205-
k = out[1:2,::]
206-
v = out[2:3,::]
207-
out = q * k^t;
208-
out = attn_mask + out;
209-
out = softmax(out);
210-
out = dropout(out);
211-
out = out * v;
212-
out = transpose(out, perm=[0, 2, 1, 3]);
213-
out = out_linear(out);
214-
out = layer_norm(x + dropout(linear_bias + out));
197+
198+
.. code-block:: python
199+
200+
if pre_layer_norm:
201+
out = layer_norm(x)
202+
out = linear(out) + qkv) + bias
203+
else:
204+
out = linear(x) + bias
205+
out = transpose(out, perm=[2, 0, 3, 1, 4])
206+
# extract q, k and v from out.
207+
q = out[0:1,::]
208+
k = out[1:2,::]
209+
v = out[2:3,::]
210+
out = q * k^t
211+
out = attn_mask + out
212+
out = softmax(out)
213+
out = dropout(out)
214+
out = out * v
215+
out = transpose(out, perm=[0, 2, 1, 3])
216+
out = out_linear(out)
217+
out = layer_norm(x + dropout(linear_bias + out))
215218
216219
Parameters:
217220
x (Tensor): The input tensor of fused_multi_head_attention. The shape is
@@ -245,6 +248,9 @@ def fused_multi_head_attention(x,
245248
ln_epsilon (float, optional): Small float value added to denominator of layer_norm
246249
to avoid dividing by zero. Default is 1e-5.
247250
251+
Returns:
252+
Tensor: The output Tensor, the data type and shape is same as `x`.
253+
248254
Examples:
249255
250256
.. code-block:: python

python/paddle/incubate/nn/layer/fused_transformer.py

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,12 @@
2424

2525
class FusedMultiHeadAttention(Layer):
2626
"""
27-
Attention mapps queries and a set of key-value pairs to outputs, and
27+
Attention mapps queries and a set of key-value pairs to outputs, and
2828
Multi-Head Attention performs multiple parallel attention to jointly attending
2929
to information from different representation subspaces.
3030
Please refer to `Attention Is All You Need <https://arxiv.org/pdf/1706.03762.pdf>`_
3131
for more details.
32+
3233
Parameters:
3334
embed_dim (int): The expected feature size in the input and output.
3435
num_heads (int): The number of heads in multi-head attention.
@@ -42,17 +43,18 @@ class FusedMultiHeadAttention(Layer):
4243
`embed_dim`. Default None.
4344
vdim (int, optional): The feature size in value. If None, assumed equal to
4445
`embed_dim`. Default None.
45-
normalize_before (bool, optional): Indicate whether it is pre_layer_norm (True)
46-
or post_layer_norm architecture (False). Default False.
46+
normalize_before (bool, optional): Indicate whether it is pre_layer_norm
47+
(True) or post_layer_norm architecture (False). Default False.
4748
need_weights (bool, optional): Indicate whether to return the attention
4849
weights. Now, only False is supported. Default False.
4950
weight_attr(ParamAttr, optional): To specify the weight parameter property.
5051
Default: None, which means the default weight parameter property is used.
51-
See usage for details in :code:`ParamAttr` .
52+
See usage for details in :code:`ParamAttr`.
5253
bias_attr (ParamAttr|bool, optional): To specify the bias parameter property.
5354
Default: None, which means the default bias parameter property is used.
5455
If it is set to False, this layer will not have trainable bias parameter.
55-
See usage for details in :code:`ParamAttr` .
56+
See usage for details in :code:`ParamAttr`.
57+
5658
Examples:
5759
5860
.. code-block:: python
@@ -139,6 +141,7 @@ def forward(self, query, key=None, value=None, attn_mask=None, cache=None):
139141
"""
140142
Applies multi-head attention to map queries and a set of key-value pairs
141143
to outputs.
144+
142145
Parameters:
143146
query (Tensor): The queries for multi-head attention. It is a
144147
tensor with shape `[batch_size, query_length, embed_dim]`. The
@@ -163,6 +166,7 @@ def forward(self, query, key=None, value=None, attn_mask=None, cache=None):
163166
nothing wanted or needed to be prevented attention to. Default None.
164167
cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional):
165168
Now, only None is supported. Default None.
169+
166170
Returns:
167171
Tensor|tuple: It is a tensor that has the same shape and data type \
168172
as `query`, representing attention output.

0 commit comments

Comments
 (0)