weight_only_linear

paddle.nn.quant. weight_only_linear ( x: Tensor, weight: Tensor, bias: Tensor | None = None, weight_scale: Tensor | None = None, weight_dtype: DTypeLike = 'int8', arch: int | None = None, group_size: _GroupSize = -1 ) → Tensor [source]

Applies matrix multiplication of two tensors and then bias addition if provided. This method requires CUDA version >= 11.2.

Parameters

x (Tensor) – The first input Tensor to be multiplied, the data type is float16 or bfloat16.
weight (Tensor) – The second input Tensor to be multiplied. Its rank must be 2.
bias (Tensor|None) – The input bias Tensor. If it is None, no bias addition would be performed. Otherwise, The bias is added to the matrix multiplication result.
weight_scale (Tensor|None) – The input scale Tensor Provided to weight for dequantization. Its rank must be 1.
weight_dtype (str) – The dtype of weight Tensor, must be one of ‘int8’, ‘int4’, Defaulted to ‘int8’.
arch (int) – The compute arch for target device. For example, A100 is 80, v100 is 70, if you do not assign arch, we will get arch from your device, default: None.
group_size (int) – The group size for weight quantization. -1 stands for default per-channel mode. Currently only support 64 or 128.

Returns

the output Tensor, the data type is the same as that of x.

Return type

Tensor

Examples

>>> 
>>> import paddle
>>> from paddle.nn.quant import weight_only_linear

>>> x = paddle.cast(paddle.randn([1, 2, 64]), dtype='float16')
>>> weight = paddle.cast(paddle.randint(0, 127, [32, 64]), dtype='int8')
>>> scale = paddle.randn([32], dtype='float32')
>>> bias = paddle.cast(paddle.randn([32]), dtype='float16')
>>> if paddle.device.cuda.get_device_capability()[0] >= 8:
...    out = weight_only_linear(x, weight, bias=bias, weight_scale=scale, weight_dtype='int8')
...    print(out.shape)
[1, 2, 32]