Gemmlowp Quantization
Table of Contents
- 1. Gemmlowp Quantization
- 1.1. Overview
- 1.2. Quantization as an affine map.
- 1.3. Domain-specific constraint: the real value 0 must be exactly representable.
- 1.4. The final form of the quantization equation
- 1.5. Quantizing a matrix multiplication
- 1.6. Implementation of quantized matrix multiplication
- 1.7. How this is implemented in gemmlowp
- 1.8. How this differs from the older legacy gemmlowp quantization paradigm
- 1.9. Example code illustrating the new quantization paradigm
1. Gemmlowp Quantization
gemmlowp 是 tflite 底层 optimized kernel 使用的 gemm 库 (General Matrix Multiply), 它支持对量化过的矩阵的运算, 但需要调用者 (tflite) 提供相应的量化参数, 如 scale, zero_point 等
tensorflow/tensorflow/lite/micro/tools/make/downloads/gemmlowp/eight_bit_int_gemm/eight_bit_int_gemm.cc::const int lhs_offset = a_offset;
https://github.com/google/gemmlowp/blob/master/doc/quantization.md
TLDR: If you prefer example code over theory, look at doc/quantization_example.cc.
1.1. Overview
gemmlowp allows to perform calculations on matrices on uint8 values, but these matrices are only useful insofar as they somehow approximate matrices of real numbers. By a quantization paradigm we mean a correspondence between matrices of quantized 8bit values and matrices of real numbers. The choice of a quantization paradigm affects the calculations that gemmlowp itself needs to perform, specifically, it affects how one goes from internal 32bit accumulator to final 8bit outputs.
The part of gemmlowp transforming internal 32bit accumulator to final 8bit outputs is the "output pipeline" described in output.md.
gemmlowp's GemmWithOutputPipeline
entry point allows specifying an
arbitrary output pipeline, allowing the user to implement their own
preferred quantized arithmetic paradigm.
In the present document, our purpose is to show how, reasoning from first principles and some domain-specific knowledge of neural networks, we can arrive naturally at some specific quantization paradigm, and how that can be implemented using a specific output pipeline.
We also aim to show how that differs from the older, legacy quantization paradigm implemented by gemmlowp's legacy interfaces and why the change to the newer quantization paradigm described in this document was useful as far as some applications of gemmlowp were concerned.
1.2. Quantization as an affine map.
In order for arithmetic on real values to map directly to arithmetic on quantized uint8 values, the mapping between real and quantized uint8 values must be affine, which means it must be of the form
real_value = A * quantized_value + B (1)
for some constants A, B, or equivalently, of the form
real_value = C * (quantized_value + D) (2)
for some constants C, D. Indeed, anything else than such an affine map would mean that the result of the quantized calculations do no longer readily provide an approximation to the result of the real-numbers calculation.
1.3. Domain-specific constraint: the real value 0 must be exactly representable.
Here a domain-specific constrain from neural networks appears: for some neural network layers, it is very useful for optimized implementations that the real-value 0 be exactly representable.
For instance, in a Convolutional or Pooling layer with padding, it is useful to be able to implement the padding by zero-padding the input array, so that optimized loops do not need to become more complex to avoid overrunning the array bounds.
In order for such zero-padding to be feasible in a quantized implementation of such layers, it is important that the real value '0' be exactly representable in quantized form, i.e. that it correspond exactly to some quantized value, which we call the zero-point.
Indeed, if '0' were not exactly representable, then we would have to use some quantized value for padding, that does not exactly correspond to the real value '0'. That would typically introduce inaccuracy in the result. In fact, using always the same such value would be worse: it would introduce bias in the result.
1.4. The final form of the quantization equation
Now let us phrase what this constraint — that the real value 0 be exactly representable — means in either quantization equations, (1) and (2).
In equation (1), plugging real_value = 0
and
quantized_value = zero_point
, we get:
0 = A * zero_point + B
equivalently:
zero_point = -B / A
We are thus left with a rather awkward constraint: the real number
-B / A
must somehow be guaranteed to be exactly integral, so that the
special uint8 value zero_point
can be exactly equal to it. Quite
awkward!
Now let us look at equation (2). Plugging real_value = 0
and
quantized_value = zero_point
, we get:
0 = C * (zero_point + D)
Conveniently, the constant C
plays no role anymore, so this equation
simplifies to:
0 = zero_point + D
In other words, D = -zero_point
. This suggests rewriting the
quantization equation (2) into the following form (3), which will be the
final form that we will consistently use:
real_value = scale * (quantized_value - zero_point) (3)
To go from (2) to (3), we merely renamed C
to scale
and D
to
-zero_point
.
With this quantization equation (3), the condition that 0 be exactly
representable is vacuously satisfied: zero_point
is by definition one
of the possible quantized_value
's, and equation (3) maps it to a
real_value
of exactly 0.
Note that the final quantizaton equation (3) depends on two constants, one integral, the other an arbitrary positive real number:
zero_point
is integral, more specifically is one of the possible quantized values (i.e. typically is a uint8 value).scale
is a positive real number. Thus at this stage we have not yet shown how to eliminate all usage of floating-point arithmetic. That will come below.
1.5. Quantizing a matrix multiplication
Now that we know — equation (3) — how real numbers are to correspond to quantized values (typically uint8), we turn to applying this knowledge to rewriting a multiplication of matrices of real numbers, by the equivalent multiplication of matrices of quantized values.
Say that we have two matrices of real values lhs_real_matrix
,
rhs_real_matrix
. Each entry of their product is the sum (accumulation)
of many products of individual matrix entries, say
lhs_real_value * rhs_real_value
.
Now suppose that we have already quantized these two matrices according
to the above equation (3), with some already-known quantization
parameters lhs_scale
, rhs_scale
, lhs_zero_point
, rhs_zero_point
,
so that their matrix entries are quantized as
lhs_real_value[i] = lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) rhs_real_value[i] = rhs_scale * (rhs_quantized_value[i] - rhs_zero_point)
We then rewrite the matrix product accumulator accordingly:
result_real_value = Sum_over_i(lhs_real_value[i] * rhs_real_value[i]) = Sum_over_i( lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) * rhs_scale * (rhs_quantized_value[i] - rhs_zero_point) ) = lhs_scale * rhs_scale * Sum_over_i( (lhs_quantized_value[i] - lhs_zero_point) * (rhs_quantized_value[i] - rhs_zero_point) ) (4)
Now our goal is to represent this result itself as a quantized matrix,
i.e. still according to equation (3), for some pre-established
quantization parameters result_scale
and result_zero_point
, as
result_real_value = result_scale * (result_quantized_value - result_zero_point)
Here we need to keep in mind that our goal is to specify what the
quantized matrix multiplication should do, i.e. how to compute
result_quantized_value
. The last equation above is equivalent to
result_quantized_value = result_zero_point + result_real_value / result_scale
Now we can use equation (4) above to plug into this the expression of result_real_value in terms of the quantized operands, and we obtain:
result_quantized_value = result_zero_point + (lhs_scale * rhs_scale / result_scale) * Sum_over_i( (lhs_quantized_value[i] - lhs_zero_point) * (rhs_quantized_value[i] - rhs_zero_point) ) (5)
Equation (5) is the conclusion of this general discussion of how to specify what "quantized matrix multiplication" should actually compute, in order to be able to replace real matrix multiplications.
1.6. Implementation of quantized matrix multiplication
Having obtained the mathematical form (5) of quantized matrix multiplication, we now turn to its actual implementation.
The inner-most part of (5),
int32_accumulator = Sum_over_i( (lhs_quantized_value[i] - lhs_zero_point) * (rhs_quantized_value[i] - rhs_zero_point) )
is the "kernel" accumulation loop. It is where the bulk of the
computational cost goes. Luckily, it only involves integers: the
quantized operands matrix entries, and their zero_point
quantization
parameters. Typically, all of these values are uint8. Typically, the
above differences of uint8 values would be represented as signed int16;
their products as signed int32.
It is out of scope of the present doc to discuss how to avoid the
overhead of having to subtract these zero_point
constants in this
inner loop; refer to
this section of
low-precision.md for that. The gist of it is that a mathematical trick
allows us to take the handling of these zero_point
constants out of
this accumulation loop, so that it simplifies to
int32_accumulator = Sum_over_i( lhs_quantized_value[i] * rhs_quantized_value[i] ) (6)
Anyway, the result is a int32_accumulator
that we now plug back into
the rest of (5):
result_quantized_value = result_zero_point + (lhs_scale * rhs_scale / result_scale) * int32_accumulator (7)
The difficulty here is of course that
(lhs_scale * rhs_scale / result_scale)
is a positive real number, not
an integer in general. It is a constant, though. So what we have to
implement here is the (approximate) scaling of a int32 value by some
arbitrary positive constant multiplier.
Moreover, it is safe to assume that this positive constant multiplier is
smaller than one — each of the scale
values here is typically
smaller than one, as we are typically mapping the [0..255]
quantized
uint8 value range to an interval of real values that is much narrower
than that, typically within [-10,10]
in most neural networks. For
example, a neural network using Relu6 activation functions will
typically have real activation values in the interval [0,6].
So how do we implement the multiplication of a int32 value by a positive
real constant that is smaller than one? Typically, by multiplying by a
fixed-point constant multiplier in the normalized interval [1/2,1)
,
and right-shifting the result to achieve the correct multiplier.
At this point we have obtained the int32 value of the product
(lhs_scale * rhs_scale / result_scale) * int32_accumulator
Looking at (7), it only remains to add to it the integral value
result_zero_point
, and we are done.
1.7. How this is implemented in gemmlowp
The different parts of gemmlowp implementing aspects of the above discussion are:
- The packing stage (see packing.md) implements the special
mathematical trick to handle
lhs_offset
,rhs_offset
that we alluded to above, see this section of low-precision.md for details. Thanks to is, the rest of the calculation can proceed as iflhs_offset
,rhs_offset
were 0. - The compute/kernel stage (see kernel.md) performs the core
accumulation loop producing the
int32_accumulator
, see equation (6) above. - The unpacking stage feeds into the output pipeline (see output.md), which implements the rest of the evaluation of the above equation (5), that we discussed in the previous section.
Now, the point of gemmlowp's flexible output-pipelines mechanism (see output.md) is to support different quantization paradigms, so we now have to specify which particular flavor of output pipeline corresponds to the particular quantization paradigm that we detailed above in this document.
The specific output pipeline stage implementing the present quantization
paradigm, i.e. implementing the precise computation detailed in the
previous section (equation (5)), is
OutputStageQuantizeDownInt32ByFixedPoint
.
Please refer to the comment explaining it in public/output_stages.h.
1.8. How this differs from the older legacy gemmlowp quantization paradigm
The difference between the older legacy quantization paradigm described
in low-precision.md and the newer one described in this
document boils down to the difference between the legacy output stage
implementing it, OutputStageQuantizeDownInt32ToUint8Scale
, and the new
output stage implementing the new paradigm,
OutputStageQuantizeDownInt32ByFixedPoint
.
Please refer to the comments in public/output_stages.h for details about these two output stages and how they differ.
Issues with the old output stage
OutputStageQuantizeDownInt32ToUint8Scale
are:
- The int32 accumulators (inputs to the output stage) undergo a plain
int32 multiplication with a int32 multiplier, which may overflow. By
contrast, in the newer
OutputStageQuantizeDownInt32ByFixedPoint
, this integer multiplication becomes a fixed-point multiplication and cannot overflow.- In practice, to limit the risk of overflow, this pushes users to choose smaller values for this integer multiplier, which means limited multiplicative accuracy, which may cause multiplicative bias depending on how it is used.
- Note how the order of multiplying by the multipler and adding the
result_offset
are swapped. This reflects a quantizatin equation of the form (1) above, as opposed to the form (2)/(3) that the new quantization paradigm uses. As a result, it is essentially impossible to guarantee that 0 is an exactly-representable value, which as discussed above is an issue at least in some convolutional neural network applications.
1.9. Example code illustrating the new quantization paradigm
Example code showing how to perfom a quantized matrix multiplication in the quantization paradigm discussed here is in doc/quantization_example.cc.
Backlinks
Quantization (Quantization > Gemmlowp Quantization): Gemmlowp Quantization