llm_int8_linear

paddle.nn.quant. llm_int8_linear ( x: Tensor, weight: Tensor, bias: Tensor | None = None, weight_scale: Tensor | None = None, threshold: float = 6.0 ) → Tensor [source]

Applies matrix multiplication of two tensors and then bias addition if provided. This method requires CUDA version >= 11.2.

Parameters

x (Tensor) – the first input Tensor to be multiplied, the data type is float16 or bfloat16.
weight (Tensor) – the second input Tensor to be multiplied. Its rank must be 2.
bias (Tensor|None) – the input bias Tensor. If it is None, no bias addition would be performed. Otherwise, the bias is added to the matrix multiplication result.
weight_scale (Tensor|None) – the input scale Tensor Provided to weight for dequantization. Its rank must be 1.
threshold (float) – The min value of outlier in activation, outlier’s channel will be apply multiply with x.dtype.

Returns

the output Tensor, the data type is the same as that of x.

Return type

Tensor

Examples

>>> 
>>> import paddle
>>> from paddle.nn.quant import llm_int8_linear

>>> x = paddle.cast(paddle.randn([1, 2, 64]), dtype='float16')
>>> weight = paddle.cast(paddle.randint(0, 127, [32, 64]), dtype='int8')
>>> scale = paddle.randn([32], dtype='float32')
>>> bias = paddle.cast(paddle.randn([32]), dtype='float16')
>>> if paddle.device.cuda.get_device_capability()[0] >= 8:
...    out = llm_int8_linear(x, weight, bias=bias, weight_scale=scale, threshold=6.0)
...    print(out.shape)
[1, 2, 32]