TransformerDecoder

class paddle.nn. TransformerDecoder ( decoder_layer: TransformerDecoderLayer, num_layers: int, norm: Optional[LayerNorm] = None ) [source]

TransformerDecoder is a stack of N decoder layers.

Parameters

decoder_layer (Layer) – an instance of the TransformerDecoderLayer. It would be used as the first layer, and the other layers would be created according to the configurations of it.
num_layers (int) – The number of decoder layers to be stacked.
norm (LayerNorm|None, optional) – the layer normalization component. If provided, apply layer normalization on the output of last encoder layer.

Examples

>>> import paddle
>>> from paddle.nn import TransformerDecoderLayer, TransformerDecoder

>>> # decoder input: [batch_size, tgt_len, d_model]
>>> dec_input = paddle.rand((2, 4, 128))
>>> # encoder output: [batch_size, src_len, d_model]
>>> enc_output = paddle.rand((2, 6, 128))
>>> # self attention mask: [batch_size, n_head, tgt_len, tgt_len]
>>> self_attn_mask = paddle.rand((2, 2, 4, 4))
>>> # cross attention mask: [batch_size, n_head, tgt_len, src_len]
>>> cross_attn_mask = paddle.rand((2, 2, 4, 6))
>>> decoder_layer = TransformerDecoderLayer(128, 2, 512)
>>> decoder = TransformerDecoder(decoder_layer, 2)
>>> output = decoder(dec_input,
...                  enc_output,
...                  self_attn_mask,
...                  cross_attn_mask)
>>> print(output.shape)
[2, 4, 128]

forward ( tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, cache: None = None ) → Tensor forward¶

forward ( tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, cache: Sequence[tuple[MultiHeadAttention.Cache, MultiHeadAttention.StaticCache]] = None ) → tuple[Tensor, list[tuple[MultiHeadAttention.Cache, MultiHeadAttention.StaticCache]]]

Applies a stack of N Transformer decoder layers on inputs. If norm is provided, also applies layer normalization on the output of last decoder layer.

Parameters

tgt (Tensor) – The input of Transformer decoder. It is a tensor with shape [batch_size, target_length, d_model]. The data type should be float32 or float64.
memory (Tensor) – The output of Transformer encoder. It is a tensor with shape [batch_size, source_length, d_model]. The data type should be float32 or float64.
tgt_mask (Tensor|None, optional) – A tensor used in self attention to prevents attention to some unwanted positions, usually the the subsequent positions. It is a tensor with shape broadcasted to [batch_size, n_head, target_length, target_length]. When the data type is bool, the unwanted positions have False values and the others have True values. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have -INF values and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Default None.
memory_mask (Tensor|None, optional) – A tensor used in decoder-encoder cross attention to prevents attention to some unwanted positions, usually the paddings. It is a tensor with shape broadcasted to [batch_size, n_head, target_length, source_length]. When the data type is bool, the unwanted positions have False values and the others have True values. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have -INF values and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Default None.
cache (list|tuple, optional) – It is a list, and each element in the list is a tuple( (incremental_cache, static_cache) ). See TransformerDecoder.gen_cache for more details. It is only used for inference and should be None for training. Default None.

Returns

It is a tensor that has the same shape and data type: as tgt, representing the output of Transformer decoder. Or a tuple if cache is not None, except for decoder output, the tuple includes the new cache which is same as input cache argument but incremental_cache in it has an incremental length. See MultiHeadAttention.gen_cache and MultiHeadAttention.forward for more details.

Return type

Tensor|tuple

gen_cache ( memory: Tensor, do_zip: Literal[False] = False ) → list[tuple[paddle.nn.layer.transformer.Cache, paddle.nn.layer.transformer.StaticCache]] | list[tuple[paddle.nn.layer.transformer.Cache, ...] | tuple[paddle.nn.layer.transformer.StaticCache, ...]] gen_cache¶

gen_cache ( memory: Tensor, do_zip: Literal[True] = False ) → list[tuple[paddle.nn.layer.transformer.Cache, ...] | tuple[paddle.nn.layer.transformer.StaticCache, ...]]

gen_cache ( memory: Tensor, do_zip: bool = False ) → list[tuple[paddle.nn.layer.transformer.Cache, paddle.nn.layer.transformer.StaticCache]] | list[tuple[paddle.nn.layer.transformer.Cache, ...] | tuple[paddle.nn.layer.transformer.StaticCache, ...]]

Generates cache for forward usage. The generated cache is a list, and each element in it is a tuple( (incremental_cache, static_cache) ) produced by TransformerDecoderLayer.gen_cache. See TransformerDecoderLayer.gen_cache for more details. If do_zip is True, apply zip on these tuples to get a list with two elements.

Parameters

memory (Tensor) – The output of Transformer encoder. It is a tensor with shape [batch_size, source_length, d_model]. The data type should be float32 or float64.
do_zip (bool, optional) – Indicate whether to apply zip on the tuples. If True, return a list with two elements. Default False

Returns

It is a list, and each element in the list is a tuple produced: by TransformerDecoderLayer.gen_cache(memory). See TransformerDecoderLayer.gen_cache for more details. If do_zip is True, apply zip on these tuples and return a list with two elements.

Return type

list