moe_ffn¶
- paddle.incubate.nn.functional. moe_ffn ( permute_input: Tensor, token_nums_per_expert: Tensor, ffn1_weight: Tensor, ffn2_weight: Tensor, ffn1_bias: Tensor | None = None, ffn1_scale: Tensor | None = None, ffn2_scale: Tensor | None = None, quant_method: str = 'None' ) Tensor [source]
-
Applies the feed-forward network (FFN) to the dispatched tokens for each expert.
This function performs the FFN computation for the tokens assigned to each expert. It supports optional quantization methods for the weights.
- Parameters
-
permute_input (Tensor) – The input tensor after dispatching, with shape [total_tokens, d_model].
token_nums_per_expert (Tensor) – The number of tokens assigned to each expert.
ffn1_weight (Tensor) – The weight for the first linear layer, with shape [num_experts, d_model, d_ffn * 2].
ffn2_weight (Tensor) – The weight for the second linear layer, with shape [num_experts, d_ffn, d_model].
ffn1_bias (Tensor | None) – Bias for the first linear layer, with shape [num_experts, 1, d_ffn * 2]. If None, bias is not used.
ffn1_scale (Tensor | None) – Scale tensor for dequantization of ffn1_weight, with shape [num_experts, d_ffn * 2]. If None, scale is not applied.
ffn2_scale (Tensor | None) – Scale tensor for dequantization of ffn2_weight, with shape [num_experts, d_model]. If None, scale is not applied.
quant_method (str) – Quantization method to be used. Currently not supported. Default is “None”.
- Returns
-
The output tensor after FFN computation, with shape [total_tokens, d_model].
- Return type
-
Tensor
Examples
>>> >>> import paddle >>> from paddle.incubate.nn.functional import moe_ffn >>> permute_input = paddle.randn([7680, 768]) >>> token_nums_per_expert = paddle.to_tensor([48], dtype='int64') >>> ffn1_weight = paddle.randn([48, 768, 6144]) >>> ffn2_weight = paddle.randn([48, 3072, 768]) >>> out = moe_ffn(permute_input, token_nums_per_expert, ffn1_weight, ffn2_weight, None, None) >>> print(out.shape) [7680, 768]