llm_int8_linear¶
- paddle.nn.quant. llm_int8_linear ( x, weight, bias=None, weight_scale=None, threshold=6.0 ) [source]
-
Applies matrix multiplication of two tensors and then bias addition if provided. This method requires CUDA version >= 11.2.
- Parameters
-
x (Tensor) – the first input Tensor to be multiplied, the data type is float16 or bfloat16.
weight (Tensor) – the second input Tensor to be multiplied. Its rank must be 2.
bias (Tensor|None) – the input bias Tensor. If it is None, no bias addition would be performed. Otherwise, the bias is added to the matrix multiplication result.
weight_scale (Tensor|None) – the input scale Tensor Provided to weight for dequantization. Its rank must be 1.
threshold (float) – The min value of outlier in activation, outlier’s channel will be apply multiply with x.dtype.
- Returns
-
the output Tensor, the data type is the same as that of x.
- Return type
-
Tensor
Examples
>>> >>> import paddle >>> from paddle.nn.quant import llm_int8_linear >>> x = paddle.cast(paddle.randn([1, 2, 64]), dtype='float16') >>> weight = paddle.cast(paddle.randint(0, 127, [32, 64]), dtype='int8') >>> scale = paddle.randn([32], dtype='float32') >>> bias = paddle.cast(paddle.randn([32]), dtype='float16') >>> if paddle.device.cuda.get_device_capability()[0] >= 8: ... out = llm_int8_linear(x, weight, bias=bias, weight_scale=scale, threshold=6.0) ... print(out.shape) [1, 2, 32]