FusedMultiHeadAttention

class paddle.incubate.nn. FusedMultiHeadAttention ( embed_dim: int, num_heads: int, dropout_rate: float = 0.5, attn_dropout_rate: float = 0.5, kdim: int | None = None, vdim: int | None = None, normalize_before: bool = False, need_weights: bool = False, qkv_weight_attr: ParamAttrLike | None = None, qkv_bias_attr: ParamAttrLike | None = None, linear_weight_attr: ParamAttrLike | None = None, linear_bias_attr: ParamAttrLike | None = None, pre_ln_scale_attr: ParamAttrLike | None = None, pre_ln_bias_attr: ParamAttrLike | None = None, ln_scale_attr: ParamAttrLike | None = None, ln_bias_attr: ParamAttrLike | None = None, epsilon: float = 1e-05, nranks: int = 1, ring_id: int = -1, transpose_qkv_wb: bool = False, name: str | None = None ) [source]

Attention maps queries and a set of key-value pairs to outputs, and Multi-Head Attention performs multiple parallel attention to jointly attending to information from different representation subspaces. Please refer to Attention Is All You Need for more details.

Parameters

embed_dim (int) – The expected feature size in the input and output.
num_heads (int) – The number of heads in multi-head attention.
dropout_rate (float, optional) – The dropout probability used on attention weights to drop some attention targets for the dropout after attention. 0 for no dropout. Default 0.5.
attn_dropout_rate (float, optional) – The dropout probability used on attention weights to drop some attention targets for the dropout in attention. 0 for no dropout. Default 0.5.
kdim (int, optional) – The feature size in key. If None, assumed equal to embed_dim. Default None.
vdim (int, optional) – The feature size in value. If None, assumed equal to embed_dim. Default None.
normalize_before (bool, optional) – Indicate whether it is pre_layer_norm (True) or post_layer_norm architecture (False). Default False.
need_weights (bool, optional) – Indicate whether to return the attention weights. Now, only False is supported. Default False.
qkv_weight_attr (ParamAttr|None, optional) – To specify the weight parameter property for QKV projection computation. Default: None, which means the default weight parameter property is used. See usage for details in ParamAttr.
qkv_bias_attr (ParamAttr|bool|None, optional) – To specify the bias parameter property for QKV projection computation. The False value means the corresponding layer would not have trainable bias parameter. Default: None, which means the default bias parameter property is used. See usage for details in ParamAttr.
linear_weight_attr (ParamAttr|None, optional) – To specify the weight parameter property for linear projection computation. Default: None, which means the default weight parameter property is used. See usage for details in ParamAttr.
linear_bias_attr (ParamAttr|bool|None, optional) – To specify the bias parameter property for linear projection computation. The False value means the corresponding layer would not have trainable bias parameter. Default: None, which means the default bias parameter property is used. See usage for details in ParamAttr.
pre_ln_scale_attr (ParamAttr|None, optional) – To specify the weight parameter property for pre_layer_norm computation. Otherwise, all layers both use it as attr to create parameters. Default: None, which means the default weight parameter property is used. See usage for details in ParamAttr.
pre_ln_bias_attr (ParamAttr|bool|None, optional) – To specify the bias parameter property for pre_layer_norm computation. The False value means the corresponding layer would not have trainable bias parameter. Default: None, which means the default bias parameter property is used. See usage for details in ParamAttr.
ln_scale_attr (ParamAttr|None, optional) – To specify the weight parameter property for post_layer_norm computation. Default: None, which means the default weight parameter property is used. See usage for details in ParamAttr.
ln_bias_attr (ParamAttr|bool|None, optional) – To specify the bias parameter property for post_layer_norm computation. The False value means the corresponding layer would not have trainable bias parameter. Default: None, which means the default bias parameter property is used. See usage for details in ParamAttr.
epsilon (float, optional) – The small value added to the variance to prevent division by zero. Default: 1e-05.
nranks (int, optional) – Distributed tensor model parallel nranks. Default is 1, means not using tensor parallel.
ring_id (int, optional) – For distributed tensor model parallel. Default is -1, means not using tensor parallel.
transpose_qkv_wb (bool, optional) – Support input qkv matmul weight shape as [hidden_size, 3 * hidden_size] and qkv matmul bias shape as [3 * hidden_size]. Will transpose the weight to [3, num_head, head_dim, hidden_size] and transpose bias to [3, num_head, hidden_size] in the fused_attention_op. Only support for GPU for now. The default value is False, which is not do transpose to qkv_w and qkv_b.
name (str|None, optional) – For details, please refer to Name. Generally, no setting is required. Default: None.

Examples

>>> 
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> # input: [batch_size, sequence_length, embed_dim]
>>> query = paddle.rand((2, 4, 128))
>>> # self attention mask: [batch_size, num_heads, query_len, query_len]
>>> attn_mask = paddle.rand((2, 2, 4, 4))
>>> multi_head_attn = paddle.incubate.nn.FusedMultiHeadAttention(128, 2)
>>> output = multi_head_attn(query, None, None, attn_mask=attn_mask)
>>> print(output.shape)
[2, 4, 128]

forward ( query: Tensor, key: Tensor | None = None, value: Tensor | None = None, attn_mask: Tensor | None = None, cache: None = None ) → Tensor forward¶

Applies multi-head attention to map queries and a set of key-value pairs to outputs.

Parameters

query (Tensor) – The queries for multi-head attention. It is a tensor with shape [batch_size, query_length, embed_dim]. The data type should be float32 or float64.
key (Tensor, optional) – The keys for multi-head attention. It is a tensor with shape [batch_size, key_length, kdim]. The data type should be float32 or float64. If None, use query as key. Default None.
value (Tensor, optional) – The values for multi-head attention. It is a tensor with shape [batch_size, value_length, vdim]. The data type should be float32 or float64. If None, use query as value. Default None.
attn_mask (Tensor, optional) – A tensor used in multi-head attention to prevents attention to some unwanted positions, usually the paddings or the subsequent positions. It is a tensor with shape broadcasted to [batch_size, n_head, sequence_length, sequence_length]. When the data type is bool, the unwanted positions have False values and the others have True values. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have -INF values and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Default None.
cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional) – Now, only None is supported. Default None.

Returns

It is a tensor that has the same shape and data type as query, representing attention output.

Return type

Tensor|tuple

extra_repr ( ) → str extra_repr¶: Extra representation of this layer, you can have custom implementation of your own layer.

FusedMultiHeadAttention

forward¶

extra_repr¶