TFLite Quantization Details
Table of Contents
1. TFLite Quantization Details
tensorflow/tensorflow/lite/tools/optimize/quantize_model.cc::TfLiteStatus QuantizeModel(flatbuffers::FlatBufferBuilder* builder,
1.1. Mul Reference Kernel
以最简单的 Mul 的 reference kernel 为例, 介绍 EvalQuantized 的流程
1.1.1. Overview
Mul, 即 a*b=c 操作, 量化后实际上需要计算的是:
已知 \(S_a, Q_a, Z_a, S_b, Q_b, Z_b, S_c, Z_c\), 根据 \(S_a(Q_a-Z_a)*S_b(Q_b-Z_b)=S_c(Q_c-Z_c)\) 计算出 \(Q_c\)
tflite 量化时会保证所有的 \(Z\) 都是整数, 但 \(S\) 是浮点数
1.1.2. rounding_doubling_high_mul
rounding_doubling_high_mul 解决的问题是如何用整数运算来计算 int*float
#include <cmath> #include <cstdint> #include <cstdio> #include <limits> // saturating (饱和) 是指 overflow 时返回最大值而不是被截断的值 inline int32_t SaturatingRoundingDoublingHighMul(int32_t a, int32_t b) { bool overflow = a == b && a == std::numeric_limits<int32_t>::min(); int64_t a_64(a); int64_t b_64(b); int64_t ab_64 = a_64 * b_64; int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30)); int32_t ab_x2_high32 = static_cast<int32_t>((ab_64 + nudge) / (1ll << 31)); return overflow ? std::numeric_limits<int32_t>::max() : ab_x2_high32; } int quantized_mulitply(int32_t a, float b) { // prepare int shift = 0; double q = std::frexp(b, &shift); printf("-- %f %d\n", q, shift); auto q_fixed = static_cast<int64_t>(int(q * (1ll << 31))); printf("-- %ld\n", q_fixed); // eval int32_t x = SaturatingRoundingDoublingHighMul(100, (int32_t)q_fixed); printf("-- %d\n", x); return x >> (-shift); } int main(int argc, char *argv[]) { int32_t a = 100; float b = 0.12; printf("%d * %f = %d\n", a, b, quantized_mulitply(a, b)); return 0; }
– 0.768000 -6 – 1649267456 – 77 100 * 0.012000 = 1
Backlinks
Quantization (Quantization > Overview): scale 正常为浮点数, 但有些框架例如 CMSIS 会要求 scale 的值必须是 \(2^{n}\) 的形式, 这样可以避免浮点数乘法, 同时可以用移位代替整数乘法. tflite 会使用浮点数 scale, 但是可以用 rounding_doubling_high_mul 转换为整数乘法.
1.1.3. Prepare
tensorflow/tensorflow/lite/kernels/mul.cc::TfLiteStatusPrepare(TfLiteContext* context, TfLiteNode* node)
mul.cc:Prepare: // S_a(Q_a-Z_a)*S_b(Q_b-Z_b)=S_c(Q_c-Z_c) // real_multiplier 为 (S_a*S_b)/(S_c) double real_multiplier = input1->params.scale * input2->params.scale / output->params.scale; // QuantizeMultiplier 会把 real_multiplier 转换为一个整数和一个 shift 的形式 QuantizeMultiplier(real_multiplier, &data->output_multiplier, &data->output_shift);
1.1.3.1. QuantizeMultiplier
QuantizeMultiplier 会把浮点的 real_multiplier (即 scale) 变为 int/shift 的形式, 例如, 若 real_multiplier 为 0.012, 则 quantized_multiplier 为 1649267456, shift 为 -6 (参考 rounding_doubling_high_mul)
void QuantizeMultiplier(double double_multiplier, int32_t* quantized_multiplier, int* shift) { const double q = std::frexp(double_multiplier, shift); auto q_fixed = static_cast<int64_t>(TfLiteRound(q * (1ll << 31))); *quantized_multiplier = static_cast<int32_t>(q_fixed); }
Backlinks
Quantization (Quantization > Overview): 其中两个 layer 之间的 `DQ–Q` 是一个常量, 可以提前转换成近似的整数操作 (例如 cmsis 的 output_shift 和 tflite 的 real_multiplier)
output_shift 与 bias_shift (CMSIS Quantization > output_shift 与 bias_shift > 计算 dense layer 的量化参数): 用 tflite 的量化模型进行推理时, 在 Prepare 阶段同样使用 output_shift, 不过名字叫 real_multiplier
1.1.4. EvalQuantized
tensorflow/tensorflow/lite/kernels/mul.cc::TfLiteStatusEvalQuantized
template <KernelType kernel_type> TfLiteStatus EvalQuantized( TfLiteContext* context, TfLiteNode* node, TfLiteMulParams* params, const OpData* data, const TfLiteTensor* input1, const TfLiteTensor* input2, TfLiteTensor* output) { tflite::ArithmeticParams op_params; op_params.input1_offset = -input1->params.zero_point; op_params.input2_offset = -input2->params.zero_point; op_params.output_offset = output->params.zero_point; op_params.output_multiplier = data->output_multiplier; op_params.output_shift = data->output_shift; #define TF_LITE_MUL(type, opname, dtype) \ type::opname( \ op_params, GetTensorShape(input1), GetTensorData<dtype>(input1), \ GetTensorShape(input2), GetTensorData<dtype>(input2), \ GetTensorShape(output), GetTensorData<dtype>(output)) if (input1->type == kTfLiteInt8) { TF_LITE_MUL(reference_integer_ops, Mul, int8_t); } }
- reference kernel 比较低效: 它会针对每个元素计算乘积, 实际上在 x86 上的 mul optimized kernel 同样使用 Elementwise Mul.
tflite 的 optimized kernel 只对 fully_connected, conv 等使用 来做 gemm
tensorflow/tensorflow/lite/kernels/internal/optimized/integer_ops/fully_connected.h::inlinevoid FullyConnected
1.1.5. Reference Kernel
template <typename T> inline void MulElementwise( int size, const ArithmeticParams& params, const T* input1_data, const T* input2_data, T* output_data) { for (int i = 0; i < size; ++i) { // Q_a-Z_a const int32 input1_val = params.input1_offset + input1_data[i]; // Q_b-Z_b const int32 input2_val = params.input2_offset + input2_data[i]; // Q_c=Z_c+real_multiplier*((Q_a-Z_a)*(Q_b-Z_b)) const int32 output = params.output_offset + MultiplyByQuantizedMultiplier( input1_val * input2_val, params.output_multiplier, params.output_shift); output_data[i] = static_cast<T>(output); } }
inline int32 MultiplyByQuantizedMultiplier( int32 x, int32 quantized_multiplier, int shift) { using gemmlowp::RoundingDivideByPOT; using gemmlowp::SaturatingRoundingDoublingHighMul; int right_shift = -shift; return RoundingDivideByPOT( SaturatingRoundingDoublingHighMul(x, quantized_multiplier), right_shift); }
1.2. FullyConnected Reference Kernel
1.2.1. Quantize
tensorflow/tensorflow/lite/toco/graph_transformations/quantize.cc::bool ChooseQuantizationForOperatorInput
针对 FullyConnected 进行 quantize 时, 先强制要求 bias 的 scale == input_scale * weight_scale, 以便后面 \(Q_w*Q_x+Q_b\) 可以在相同的 scale 下进行计算
if (is_bias_vector) { // Quantization of bias vector. // We need both of the mandatory inputs (input activations and weights) to // have been already quantized. const auto& input_activations = model->GetArray(op.inputs[activations_input_index]); const auto& input_weights = model->GetArray(op.inputs[weights_input_index]); const auto input_activations_scale = input_activations.quantization_params->scale; const auto input_weights_scale = input_weights.quantization_params->scale; quantization_params->scale = input_activations_scale * input_weights_scale; quantization_params->zero_point = 0; *quantized_data_type = GetQuantizedDataType(array, ArrayDataType::kInt32); return true; }
1.2.2. Prepare
tensorflow/tensorflow/lite/kernels/fully_connected.cc::TfLiteStatus
PrepareImpl(TfLiteContext* context, TfLiteNode* node)
Prepare 与 Mul 的 Prepare 基本相同: 它也会提前计算 real_multiplier = input_multiplier * weight_multiplier, 以及 QuantizeMultiplier. 需要注意的是 bias 的 scale 不需要考虑, 因为 quantize_model 时保证了它与 real_multiplier 是相同的
1.2.3. EvalQuantized
tensorflow/tensorflow/lite/kernels/fully_connected.cc::op_params.input_offset
EvalQuantized 与 Mul 的 EvalQuantized 也是基本相同
1.2.4. Reference Kernel
inline void FullyConnected( const FullyConnectedParams& params, const RuntimeShape& input_shape, const int8_t* input_data, const RuntimeShape& filter_shape, const int8_t* filter_data, const RuntimeShape& bias_shape, const int32* bias_data, const RuntimeShape& output_shape, int8_t* output_data) { const int32 input_offset = params.input_offset; const int32 filter_offset = params.weights_offset; const int32 output_offset = params.output_offset; const int32 output_multiplier = params.output_multiplier; const int output_shift = params.output_shift; const int filter_dim_count = filter_shape.DimensionsCount(); const int batches = output_shape.Dims(0); const int output_depth = output_shape.Dims(1); const int accum_depth = filter_shape.Dims(filter_dim_count - 1); for (int b = 0; b < batches; ++b) { for (int out_c = 0; out_c < output_depth; ++out_c) { int32 acc = 0; for (int d = 0; d < accum_depth; ++d) { int32 input_val = input_data[b * accum_depth + d]; int32 filter_val = filter_data[out_c * accum_depth + d]; acc += (filter_val + filter_offset) * (input_val + input_offset); } if (bias_data) { // 这里 bias 可以直接相加, 因为 bias 与 w*x 的 scale 相同 acc += bias_data[out_c]; } acc = MultiplyByQuantizedMultiplier( acc, output_multiplier, output_shift); acc += output_offset; output_data[out_c + output_depth * b] = static_cast<int8_t>(acc); } } }
1.3. BatchNorm quantization error
当训练很多个 epoch 后, 会发现量化误差很大, 通过观察 tflite 量化的数据,发现原因是 batchnorm 对应的 mul 操作的 scale 的最大值很大, 导致很大的量化误差.
根据 batchnorm 的公式, scale = gamma / (np.sqrt(var + eps)), 通过 log 可以看到, 随着训练 epoch 的增加, min(var) 和 min(mean) 都变得很小 (例如 1e-3…), 导致 scale 变大.
var 和 mean 很接近 0, 说明 hidden layer 输出有零, 有两种可能的原因:
hidden layer 中对应的权重接近于零
\(\left\{\begin{array}{c}w*x_1+b=0\\w*x_2+b=0\\x_1 \ne x_2\end{array} \implies w=0\)
- 或者 hidden layer 输出因为 relu 变为零 (Gradient Starvation)
解决方法:
- 减少 batchnorm 前面 layer 的规模, 避免出现接近零的权重
- 使用 leaky relu 等 relu 变种,避免 relu 导致的神经元死亡的情况.
- relu 放在 batchnorm 之后而不是之前
实际上在这个例子中,出现问题的原因是 relu 导致 dense 输出的某些点针对所有样本都为零, 即有些神经元已经死亡.
通过下面的脚本可以观察网络中 batchnorm 的 scale 的情况.
#!/usr/bin/env python3 # -*- coding: utf-8 -*- # 2021-08-13 09:58 import numpy as np from models import deep_cnn import config def check_model(path): print(f"------{path}------") model = deep_cnn() model.load_weights(path) bn = [x for x in model.layers if x.name.startswith("batch")] for layer in bn: print(layer.name) # y=gamma * (x - mean) / sqrt(var+eps) + beta. gamma, beta, mean, var = layer.get_weights() eps = 0.001 scale = gamma / (np.sqrt(var + eps)) bias = beta - gamma * mean print(f"---{layer.name}---") print(np.min(scale), np.max(scale)) print(np.min(bias), np.max(bias)) index = np.argmax(scale) print(index, gamma[index], var[index], mean[index]) check_model(config.SAVED_MODEL_PATH)
Backlinks
Batch Normalization (Batch Normalization > batchnorm->relu vs. relu->batchnorm > batchnorm->relu 的优点): 另一方面, relu 放在 batchnorm 之前可能会出来 BatchNorm 导致很大的量化误差 的问题
Backlinks
Quantization (Quantization > TFLite Quantization Details): TFLite Quantization Details