weight_only_linear¶
- paddle.nn.quant. weight_only_linear ( x, weight, bias=None, weight_scale=None, weight_dtype='int8', arch=None, group_size=- 1 ) [source]
-
Applies matrix multiplication of two tensors and then bias addition if provided. This method requires CUDA version >= 11.2.
- Parameters
-
x (Tensor) – The first input Tensor to be multiplied, the data type is float16 or bfloat16.
weight (Tensor) – The second input Tensor to be multiplied. Its rank must be 2.
bias (Tensor|None) – The input bias Tensor. If it is None, no bias addition would be performed. Otherwise, The bias is added to the matrix multiplication result.
weight_scale (Tensor|None) – The input scale Tensor Provided to weight for dequantization. Its rank must be 1.
weight_dtype (str) – The dtype of weight Tensor, must be one of ‘int8’, ‘int4’, Defaulted to ‘int8’.
arch (int) – The compute arch for target device. For example, A100 is 80, v100 is 70, if you do not assign arch, we will get arch from your device, default: None.
group_size (int) – The group size for weight quantization. -1 stands for default per-channel mode. Currently only support 64 or 128.
- Returns
-
the output Tensor, the data type is the same as that of x.
- Return type
-
Tensor
Examples
>>> >>> import paddle >>> from paddle.nn.quant import weight_only_linear >>> x = paddle.cast(paddle.randn([1, 2, 64]), dtype='float16') >>> weight = paddle.cast(paddle.randint(0, 127, [32, 64]), dtype='int8') >>> scale = paddle.randn([32], dtype='float32') >>> bias = paddle.cast(paddle.randn([32]), dtype='float16') >>> if paddle.device.cuda.get_device_capability()[0] >= 8: ... out = weight_only_linear(x, weight, bias=bias, weight_scale=scale, weight_dtype='int8') ... print(out.shape) [1, 2, 32]