## Table of contents

This article only discusses the 8-bit quantization specification of TensorFlow Lite used in CMSIS-NN. For other quantization methods, you may check the following series of articles ：

## 简介

The 8-bit quantization specified by TensorFlow Lite uses the following formula for the conversion from fixed-point to floating-point:

The fixed-point values after weight quantization of perchannel or perlayer are represented by int8 two’s complement, where the zero point is represented as 0 and its range is [-127, 127]. The activation/input of perlayer is represented by int8 two’s complement, with a range of [-128, 127], and the zero point range is [-128, 127].

## Key points of the Specification

The tensorflow lite specifies the following quantization constraints to balance the running accuracy and performance of neural networks on resource-constrained devices:

Give priority to int8 quantization (although in previous versions, some specific operations supported asymmetric uint8 quantization), but in fact, there is no significant difference between asymmetric int8 and asymmetric uint8. However, using asymmetric int8 may be conveniently replaced by symmetric int8 quantization in some cases.

Usually, the quantization of perchannel refers to that performed on the output channel of convolutional weights.

Activation values use int8 asymmetric quantization, while weights use int8 symmetric quantization, and biases use int32 symmetric quantization with its scale as \(scale_i*scale_w\) . The use of asymmetric quantization for activation values is to obtain additional precision in a relatively inexpensive way, and the use of symmetric quantization for weights is mainly to reduce the additional overhead of multiplying some zero points of weights with activation values.

In the specification, the weights and biases of conv_2d and depthwise_conv_2d are configured with perchannel quantization, the weights and biases of fc are configured with perlayer quantization, and the remaining inputs and outputs are all configured with perlayer quantization.

In the specification, it is restricted that {avg_pool_2d, concat, maxpool_2d, reshape, resize_bilinear, space_to_depth, pad, gather, batch_to_space_nd, space_to_batch_nd, transpose, squeeze, max, min, slice} require the same zp and scale for input and output.

In the specification, it is restricted that the output scale of l2_norm and tanh is 1./128 and zp is 0.

In the specification, it is restricted that the output scale of sigmoid and softmax is 1./256 and zp is -128.

In the specification, it is restricted that the output scale of log_softmax is 16./256 and zp is 127.

## 在CMSIS-NN平台的变更

On the CMSIS-NN platform, the kernel also uses `q7_t`

to name the int8 and `q15_t`

to name the int16 data types. Quantization uses the scale of power of two, and the real value will be represented as
\(Q*2^-fl\)
. At this time, the scaling operation will be performed by shifting. The reason for not using the 8-bit quantization scheme of the TensorFlow Lite specification is that some Arm Cortex-M CPUs may not have a dedicated floating-point computing unit (FPU). To avoid the floating-point dequantization required between layers, another advantage is that we can use a simpler activation function lookup table.

Reference: