TransformerDecoderLayer¶

class paddle.nn. TransformerDecoderLayer ( d_model: int, nhead: int, dim_feedforward: int, dropout: float = 0.1, activation: str = 'relu', attn_dropout: float | None = None, act_dropout: float | None = None, normalize_before: bool = False, weight_attr: ParamAttrLike | Sequence[ParamAttrLike] | None = None, bias_attr: ParamAttrLike | Sequence[ParamAttrLike] | None = None, layer_norm_eps: float = 1e-05 ) [source]

TransformerDecoderLayer is composed of three sub-layers which are decoder self (multi-head) attention, decoder-encoder cross attention and feedforward network. Before and after each sub-layer, pre-process and post-precess would be applied on the input and output accordingly. If normalize_before is True, pre-process is layer normalization and post-precess includes dropout, residual connection. Otherwise, no pre-process and post-precess includes dropout, residual connection, layer normalization.

Parameters

d_model (int) – The expected feature size in the input and output.
nhead (int) – The number of heads in multi-head attention(MHA).
dim_feedforward (int) – The hidden layer size in the feedforward network(FFN).
dropout (float, optional) – The dropout probability used in pre-process and post-precess of MHA and FFN sub-layer. Default 0.1
activation (str, optional) – The activation function in the feedforward network. Default relu.
attn_dropout (float, optional) – The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Default None
act_dropout (float, optional) – The dropout probability used after FFN activation. If None, use the value of dropout. Default None
normalize_before (bool, optional) – Indicate whether to put layer normalization into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer normalization and post-precess includes dropout, residual connection. Otherwise, no pre-process and post-precess includes dropout, residual connection, layer normalization. Default False
weight_attr (ParamAttr|list|tuple|None, optional) – To specify the weight parameter property. If it is a list/tuple, weight_attr[0] would be used as weight_attr for self attention, weight_attr[1] would be used as weight_attr for cross attention, and weight_attr[2] would be used as weight_attr for linear in FFN. Otherwise, the three sub-layers all uses it as weight_attr to create parameters. Default: None, which means the default weight parameter property is used. See usage for details in api_paddle_base_param_attr_ParamAttr .
bias_attr (ParamAttr|list|tuple|bool|None, optional) – To specify the bias parameter property. If it is a list/tuple, bias_attr[0] would be used as bias_attr for self attention, bias_attr[1] would be used as bias_attr for cross attention, and bias_attr[2] would be used as bias_attr for linear in FFN. Otherwise, the three sub-layers all uses it as bias_attr to create parameters. The False value means the corresponding layer would not have trainable bias parameter. See usage for details in ParamAttr . Default: None,which means the default bias parameter property is used.
layer_norm_eps (float, optional) – the eps value in layer normalization components. Default=1e-5.

Examples

>>> import paddle
>>> from paddle.nn import TransformerDecoderLayer

>>> # decoder input: [batch_size, tgt_len, d_model]
>>> dec_input = paddle.rand((2, 4, 128))
>>> # encoder output: [batch_size, src_len, d_model]
>>> enc_output = paddle.rand((2, 6, 128))
>>> # self attention mask: [batch_size, n_head, tgt_len, tgt_len]
>>> self_attn_mask = paddle.rand((2, 2, 4, 4))
>>> # cross attention mask: [batch_size, n_head, tgt_len, src_len]
>>> cross_attn_mask = paddle.rand((2, 2, 4, 6))
>>> decoder_layer = TransformerDecoderLayer(128, 2, 512)
>>> output = decoder_layer(dec_input,
...                        enc_output,
...                        self_attn_mask,
...                        cross_attn_mask)
>>> print(output.shape)
[2, 4, 128]

forward ( tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, cache: None = None ) → Tensor forward¶

forward ( tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, cache: tuple[MultiHeadAttention.Cache, MultiHeadAttention.StaticCache] = None ) → tuple[Tensor, tuple[MultiHeadAttention.Cache, MultiHeadAttention.StaticCache]]

Applies a Transformer decoder layer on the input.

Parameters

tgt (Tensor) – The input of Transformer decoder layer. It is a tensor with shape [batch_size, target_length, d_model]. The data type should be float32 or float64.
memory (Tensor) – The output of Transformer encoder. It is a tensor with shape [batch_size, source_length, d_model]. The data type should be float32 or float64.
tgt_mask (Tensor, optional) – A tensor used in self attention to prevents attention to some unwanted positions, usually the the subsequent positions. It is a tensor with shape broadcasted to [batch_size, n_head, target_length, target_length]. When the data type is bool, the unwanted positions have False values and the others have True values. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have -INF values and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Default None.
memory_mask (Tensor, optional) – A tensor used in decoder-encoder cross attention to prevents attention to some unwanted positions, usually the paddings. It is a tensor with shape broadcasted to [batch_size, n_head, target_length, source_length]. When the data type is bool, the unwanted positions have False values and the others have True values. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have -INF values and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Default None.
cache (tuple, optional) – It is a tuple( (incremental_cache, static_cache) ), incremental_cache is an instance of MultiHeadAttention.Cache, static_cache is an instance of MultiHeadAttention.StaticCache. See `TransformerDecoderLayer.gen_cache for more details. It is only used for inference and should be None for training. Default None.

Returns

It is a tensor that has the same shape and data type: as tgt, representing the output of Transformer decoder layer. Or a tuple if cache is not None, except for decoder layer output, the tuple includes the new cache which is same as input cache argument but incremental_cache in it has an incremental length. See MultiHeadAttention.gen_cache and MultiHeadAttention.forward for more details.

Return type

Tensor|tuple

gen_cache ( memory: Tensor ) → tuple[MultiHeadAttention.Cache, MultiHeadAttention.StaticCache] gen_cache¶

Generates cache for forward usage. The generated cache is a tuple composed of an instance of MultiHeadAttention.Cache and an instance of MultiHeadAttention.StaticCache.

Parameters

memory (Tensor) – The output of Transformer encoder. It is a tensor with shape [batch_size, source_length, d_model]. The data type should be float32 or float64.

Returns

It is a tuple( (incremental_cache, static_cache) ).: incremental_cache is an instance of MultiHeadAttention.Cache produced by self_attn.gen_cache(memory, MultiHeadAttention.Cache), it reserves two tensors shaped [batch_size, nhead, 0, d_model // nhead]. static_cache is an instance of MultiHeadAttention.StaticCache produced by cross_attn.gen_cache(memory, MultiHeadAttention.StaticCache), it reserves two tensors shaped [batch_size, nhead, source_length, d_model // nhead]. See MultiHeadAttention.gen_cache and MultiHeadAttention.forward for more details.

Return type

tuple