FusedMultiHeadAttention¶

class paddle.incubate.nn. FusedMultiHeadAttention ( embed_dim, num_heads, dropout_rate=0.5, attn_dropout_rate=0.5, kdim=None, vdim=None, normalize_before=False, need_weights=False, weight_attr=None, bias_attr=None, epsilon=1e-05, name=None ) [source]

Attention mapps queries and a set of key-value pairs to outputs, and Multi-Head Attention performs multiple parallel attention to jointly attending to information from different representation subspaces. Please refer to Attention Is All You Need for more details.

Parameters

embed_dim (int) – The expected feature size in the input and output.
num_heads (int) – The number of heads in multi-head attention.
dropout_rate (float, optional) – The dropout probability used on attention weights to drop some attention targets for the dropout after attention. 0 for no dropout. Default 0.5.
attn_dropout_rate (float, optional) – The dropout probability used on attention weights to drop some attention targets for the dropout in attention. 0 for no dropout. Default 0.5.
kdim (int, optional) – The feature size in key. If None, assumed equal to embed_dim. Default None.
vdim (int, optional) – The feature size in value. If None, assumed equal to embed_dim. Default None.
normalize_before (bool, optional) – Indicate whether it is pre_layer_norm (True) or post_layer_norm architecture (False). Default False.
need_weights (bool, optional) – Indicate whether to return the attention weights. Now, only False is supported. Default False.
weight_attr (ParamAttr, optional) – To specify the weight parameter property. Default: None, which means the default weight parameter property is used. See usage for details in ParamAttr.
bias_attr (ParamAttr|bool, optional) – To specify the bias parameter property. Default: None, which means the default bias parameter property is used. If it is set to False, this layer will not have trainable bias parameter. See usage for details in ParamAttr.
epsilon (float, optional) – The small value added to the variance to prevent division by zero. Default: 1e-05.

Examples

# required: gpu
import paddle
# input: [batch_size, sequence_length, embed_dim]
query = paddle.rand((2, 4, 128))
# self attention mask: [batch_size, num_heads, query_len, query_len]
attn_mask = paddle.rand((2, 2, 4, 4))
multi_head_attn = paddle.incubate.nn.FusedMultiHeadAttention(128, 2)
output = multi_head_attn(query, None, None, attn_mask=attn_mask)  # [2, 4, 128]

forward ( query, key=None, value=None, attn_mask=None, cache=None ) forward¶

Applies multi-head attention to map queries and a set of key-value pairs to outputs.

Parameters

query (Tensor) – The queries for multi-head attention. It is a tensor with shape [batch_size, query_length, embed_dim]. The data type should be float32 or float64.
key (Tensor, optional) – The keys for multi-head attention. It is a tensor with shape [batch_size, key_length, kdim]. The data type should be float32 or float64. If None, use query as key. Default None.
value (Tensor, optional) – The values for multi-head attention. It is a tensor with shape [batch_size, value_length, vdim]. The data type should be float32 or float64. If None, use query as value. Default None.
attn_mask (Tensor, optional) – A tensor used in multi-head attention to prevents attention to some unwanted positions, usually the paddings or the subsequent positions. It is a tensor with shape broadcasted to [batch_size, n_head, sequence_length, sequence_length]. When the data type is bool, the unwanted positions have False values and the others have True values. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have -INF values and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Default None.
cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional) – Now, only None is supported. Default None.

Returns

It is a tensor that has the same shape and data type as query, representing attention output.

Return type

Tensor|tuple

extra_repr ( ) extra_repr¶: Extra representation of this layer, you can have custom implementation of your own layer.

FusedMultiHeadAttention¶

forward¶

extra_repr¶