2424
2525class  FusedMultiHeadAttention (Layer ):
2626    """ 
27-    Attention mapps queries and a set of key-value pairs to outputs, and 
27+      Attention mapps queries and a set of key-value pairs to outputs, and 
2828    Multi-Head Attention performs multiple parallel attention to jointly attending 
2929    to information from different representation subspaces. 
3030    Please refer to `Attention Is All You Need <https://arxiv.org/pdf/1706.03762.pdf>`_ 
3131    for more details. 
32+ 
3233    Parameters: 
3334        embed_dim (int): The expected feature size in the input and output. 
3435        num_heads (int): The number of heads in multi-head attention. 
@@ -42,17 +43,18 @@ class FusedMultiHeadAttention(Layer):
4243            `embed_dim`. Default None. 
4344        vdim (int, optional): The feature size in value. If None, assumed equal to 
4445            `embed_dim`. Default None. 
45-         normalize_before (bool, optional): Indicate  whether it is pre_layer_norm (True)  
46-             or post_layer_norm architecture (False). Default False. 
46+         normalize_before (bool, optional): Indicate  whether it is pre_layer_norm  
47+             (True)  or post_layer_norm architecture (False). Default False. 
4748        need_weights (bool, optional): Indicate whether to return the attention 
4849            weights. Now, only False is supported. Default False. 
4950        weight_attr(ParamAttr, optional):  To specify the weight parameter property. 
5051            Default: None, which means the default weight parameter property is used. 
51-             See usage for details in :code:`ParamAttr`  . 
52+             See usage for details in :code:`ParamAttr`. 
5253        bias_attr (ParamAttr|bool, optional): To specify the bias parameter property. 
5354            Default: None, which means the default bias parameter property is used. 
5455            If it is set to False, this layer will not have trainable bias parameter. 
55-             See usage for details in :code:`ParamAttr` . 
56+             See usage for details in :code:`ParamAttr`. 
57+ 
5658    Examples: 
5759
5860        .. code-block:: python 
@@ -139,6 +141,7 @@ def forward(self, query, key=None, value=None, attn_mask=None, cache=None):
139141        """ 
140142        Applies multi-head attention to map queries and a set of key-value pairs 
141143        to outputs. 
144+ 
142145        Parameters: 
143146            query (Tensor): The queries for multi-head attention. It is a 
144147                tensor with shape `[batch_size, query_length, embed_dim]`. The 
@@ -163,6 +166,7 @@ def forward(self, query, key=None, value=None, attn_mask=None, cache=None):
163166                nothing wanted or needed to be prevented attention to. Default None. 
164167            cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional): 
165168                Now, only None is supported. Default None. 
169+ 
166170        Returns: 
167171            Tensor|tuple: It is a tensor that has the same shape and data type \  
168172                 as `query`, representing attention output.
0 commit comments