【模型量化】神经网络量化基础及代码学习总结_内存

1 量化的介绍

量化是减少神经网络计算时间和能耗的最有效的方法之一。在神经网络量化中，权重和激活张量存储在比训练时通常使用的16-bit或32-bit更低的比特精度。当从32-bit降低到8-bit，存储张量的内存开销减少了4倍，矩阵乘法的计算成本则二次地减少了16倍。
神经网络已被证明对量化具有鲁棒性，这意味着它们可以被量化到较低的位宽，而对网络精度的影响相对较小。然而，神经网络的量化并不是自由的。低位宽量化会给网络带来噪声，从而导致精度的下降。虽然一些网络对这种噪声具有鲁棒性，但其他网络需要额外的工作来利用量化的好处。

量化实际上是将float32（32位浮点数）的参数量化到更低精度，精度的变化并不是简单的强制类型转换，而是为不同精度数据之间建立一种数据映射关系，最常见的就是定点与浮点之间的映射关系，使得以较小的精度损失代价得到较好的收益。

2 均匀仿射量化

均匀仿射量化也称为非对称量化，定义如下：
$s$ ：放缩因子（scale factor）/量化步长（step size），是浮点数
$z$ ：零点（zero-point），是整数，保证真实的0不会有量化误差，对relu和zero-padding很重要
$b$ ：位宽（bit-width），是整数，比如2, 4, 6, 8
$s$ 和 $z$ 的作用是将浮点数转化为整数，范围由b来定

1）将真实输入的浮点数 $\mathbb x$ 转化为无符号整数：
$\mathbf{x}_{int} = \mathrm{clamp}(\lfloor\frac{\mathbf{x}}{s}\rceil+z; 0, 2^b-1)$

截断/四舍五入函数的定义：
$\mathrm{clamp}(x; a, c) = \begin{cases} a, x < a, \\ x, a \leq x\leq b,\\ b, x>c. \end{cases}$

2）反量化（de-quantization）近似真实的输入 $\mathbf x$ ：
$\mathbf x\approx \mathbf{\hat x} =s(\mathbf x_{int} -z)$

结合以上1）2）步骤，得到如下量化函数的普遍定义：
$\mathbf{\hat x}=q(\mathbf x; s, z, b)=s(\mathrm{clamp}(\lfloor\frac{\mathbf{x}}{s}\rceil+z; 0, 2^b-1)-z)$

可以发现，量化函数包含了1）中的“浮点转整数”以及“反量化近似浮点”两个过程，这个过程通常被称为 伪量化（fake quantization）操作。
对伪量化的理解：把输入的浮点数据量化到整数，再反量化回浮点数，以此来模拟量化误差，同时在反向传播的时候，采用straight-through-estimator (ste)把导数回传到前面的层。

由上面的公式，有两个误差概念：
1）截断误差（clipping error）：浮点数 $x$ 超过量化范围时，会被截断，产生误差
2）舍入误差（rounding error）：在做 $\lfloor \cdot\rceil$ 时，会产生四舍五入的误差，误差范围在 $[-\frac{1}{2}, \frac{1}{2}]$
为了权衡两种误差，就需要设计合适的s和z，而它们依赖于量化范围和精度。

根据反量化过程，我们设 整数格 上的最大和最小值分别是 $q_p=q_{max}/s, q_n=q_{min}/2$ ，量化值（浮点） 范围为 $q_{min}, q_{max})$ ，其中 $q_{min}=sq_p=s(0-z)=-sz, q_{max}=sq_n=s(2^b-1-z)$ 。 $\mathbf x$ 超过这个范围会被截断，产生截断误差，如果希望减小截断误差，可以增大s的值，但是增大s会增大舍入误差，因为舍入误差的范围是 $[-\frac{1}{2}s, \frac{1}{2}s]$ 。

怎么计算放缩因子 $s$ ？
$s=\frac{q_{max}-q_{min}}{2^b-1}.$

2.1 对称均匀量化

对称均匀量化是上面非对称量化的简化版，限制了放缩因子 $z = 0$ ，但是偏移量的缺失限制了整数和浮点域之间的映射。

反量化（de-quantization）近似真实的输入 $\mathbf x$ ：
$x\approx \hat x =s\mathbf x_{int}$

将真实输入的浮点数 $\mathbb x$ 转化为无符号整数：
$\mathbf{x}_{int} = \mathrm{clamp}(\lfloor\frac{\mathbf{x}}{s}\rceil; 0, 2^b-1)$

将真实输入的浮点数 $\mathbb x$ 转化为有符号整数：
$\mathbf{x}_{int} = \mathrm{clamp}(\lfloor\frac{\mathbf{x}}{s}\rceil; -2^b, 2^b-1)$

在这里插入图片描述

坐标轴上方（蓝色）表示整数量化格，下方（黑色）表示浮点格。可以很清楚地看到，放缩因子 $s$ 就是量化的步长（step size）， $s\mathbf x_{int}$ 是反量化近似真实浮点数。

2.2 power-of-two量化（2的幂）

power-of-two量化是对称量化的特例，放缩因子被限制到2的幂， $s=2^{-k}$ ，这对硬件是高效的，因为放缩 $s$ 相当于简单的比特移位操作（bit-shifting）。

2.3 量化的粒度

1）per-tensor（张量粒度）：神经网络中最常用，硬件实现简单，累加结果都用同样的放缩因子 $s_ws_x$
2）per-channel（通道粒度）：更细粒度以提升模型性能，比如对于权重的不同输出通道采用不同的量化
3）per-group（分组粒度）

3 量化模拟过程/伪量化

量化模拟：为了测试神经网络在量化设备上的运行效果，我们经常在用于训练神经网络的相同通用硬件上模拟量化行为。
我们的目的：使用浮点硬件来近似的定点运算。
优势：与在实际的量化硬件上实验或在使用量化的卷积核上实验相比，这种模拟明显更容易实现

在这里插入图片描述

（a）在设备推理过程中，对硬件的所有输入（偏置、权重和输入激活）都是定点格式
（b）然而，当我们使用通用的深度学习框架和通用硬件来模拟量化时，这些量都是以浮点格式表示的。这就是为什么我们在计算图中引入量化器块来诱导量化效应的原因

值得注意的是：
1）每个量化器都由一组量化参数（放缩因子、零点、位宽）来定义
2）量化器的输入和输出都是浮点格式，但输出都在量化网格上
3）每个量化器都由该公式计算： $\mathbf{\hat x}=q(\mathbf x; s, z, b)=s(\mathrm{clamp}(\lfloor\frac{\mathbf{x}}{s}\rceil+z; 0, 2^b-1)-z)$ ，也就是包含了反量化过程
4）模拟量化实际上还是在浮点数上计算，模拟的其实是（截断与舍入）误差

4 基于ste的反向传播优化过程

严峻的优化问题：量化公式中中的round函数的梯度要么为零，要么到处都不定义，这使得基于梯度的训练不可能进行。一种解决方案就是采用straight-through estimator (ste）方法将round函数的梯度近似为1:
$\frac{\partial \lfloor y\rceil}{\partial y}=1$

于是，量化的梯度就可求了，现对输入 $\mathbf x$ 进行求导：
$\frac{\partial\mathbf{\hat x}}{\partial\mathbf x}=\frac{\partial q(\mathbf x)}{\partial\mathbf x}\\~~~~~~=\frac{\partial \mathrm{clamp}(\lfloor\frac{\mathbf x}{s}\rceil; q_n, q_p)s}{\partial\mathbf x}\\~~~~~~=\begin{cases} s\frac{\partial q_n}{\partial \mathbf x}=0, \mathbf x < q_{min}, \\ s\frac{\partial \lfloor \mathbf x/s\rceil}{\partial \mathbf x}=s\frac{\partial \lfloor \mathbf x/s\rceil}{\partial (\mathbf x/s)}\frac{\partial (\mathbf x/s)}{\partial \mathbf x}=s\cdot 1\cdot \frac{1}{s}=1, q_{min} \leq x\leq q_{max},\\ s\frac{\partial q_p}{\partial \mathbf x}=0, x>q_{max}. \end{cases}\\~~~~~~=\begin{cases} 0, \mathbf x < q_{min}, \\ 1, q_{min} \leq \mathbf x\leq q_{max},\\ 0, \mathbf x>q_{max}. \end{cases}$
也就是说，根据ste方法，当输入 $\mathbf x$ 在量化范围内时，其量化值对真实浮点值的梯度为1，反之为0。
对 $s$ 求导的数学推导过程如下文中lsq工作所示。
下图展示了基于ste的反向传播过程，计算时有效跳过了量化器。
在这里插入图片描述

binary神经网络中的ste
简单来说，上面提到思想是在反向传播时，将round函数 $\lfloor\cdot\rceil$ 看作identity（也就是去掉round）。对于bnn来说，ste时完全一致的做法。
前向传播： $z_b = sign(z)$
反向传播： $\frac{\partial l}{\partial z} = \frac{\partial l}{\partial z_b}$
也就是说，在反向传播时用 $z$ 代替 $s i g n (z)$ 求梯度。
或者换一种说法，也就是：
$\frac{\partial z_b}{\partial z}=\frac{\partial z}{\partial z}=1$

这个做法来自于bengio的 binaryconnect: training deep neural networks with binary weights during propagations (neurips 2015)

ste的基本实现方法
首先以round运算为例：
用detach()的方式剥离计算图：

def round_ste(x: torch.tensor):
    """
    implement straight-through estimator for rounding operation.
    """
    return (x.round() - x).detach() + x

参考：rapq

以手写backward的方式自定义梯度的计算，即返回的output=input：

class ste(torch.autograd.function):
    @staticmethod
    def forward(ctx, x, bit):
        if bit == 0:
            # no quantization
            act = x
        else:
            s = torch.max(torch.abs(x))
            if s == 0:
                act = x * 0
            else:
                step = 2 ** (bit) - 1
                scale = s / step
                p2_round_scale = true
                if p2_round_scale:
                    scale = 2 ** (torch.log2(scale).round())
                # r = torch.round(torch.abs(x) * step / s) / step
                # act =  s * r * torch.sign(x)
                act = (torch.round(torch.abs(x) / scale)) * scale * torch.sign(x)
        return act

    @staticmethod
    def backward(ctx, g):
        return g, none

参考：bsq和shiftaddnet

5 经典量化工作

learned step size quantization (iclr 2020)

顾名思义，lsq这篇文章就是在上述介绍的伪量化中引入可学习/训练的放缩因子 $s$ 。
设clamp的在 整数格 上的最大和最小值分别是 $q_p=q_{max}/s, q_n=q_{min}/2$ 。

$\hat x=s(\mathrm{clamp}(\lfloor\frac{\mathbf{x}}{s}\rceil; q_n, q_p))\\~~~~=\begin{cases} sq_n, \frac{\mathbf{x}}{s} < q_n, \\ s\lfloor\frac{\mathbf{x}}{s}\rceil, q_n \leq \frac{\mathbf{x}}{s}\leq q_p,\\ sq_p, \frac{\mathbf{x}}{s}>q_p. \end{cases}$

$\mathbf{\hat x}$ 对 $s$ 求导有：
$\frac{\partial\mathbf{\hat x}}{\partial s}=\begin{cases} q_n, \frac{\mathbf{x}}{s} < q_n, \\ \lfloor\frac{\mathbf{x}}{s}\rceil + s\frac{\partial\lfloor\frac{\mathbf{x}}{s}\rceil}{\partial s}, q_n \leq \frac{\mathbf{x}}{s}\leq q_p,\\ q_p, \frac{\mathbf{x}}{s}>q_p. \end{cases}$
其中， $q_n, q_p, \lfloor\frac{\mathbf{x}}{s}\rceil$ 都可以直接得到，但是 $s\frac{\partial\lfloor\frac{\mathbf{x}}{s}\rceil}{\partial s}$ 就不那么好算了。

根据ste，将round函数梯度近似为一个直通操作：
$s\frac{\partial\lfloor\frac{\mathbf{x}}{s}\rceil}{\partial s}=s\frac{\partial\frac{\mathbf{x}}{s}}{\partial s}=-s\frac{\mathbf x}{s^2}=-\frac{\mathbf x}{s}$

于是，得到lsq原文中的导数值：
$\frac{\partial\mathbf{\hat x}}{\partial s}=\begin{cases} q_n, \frac{\mathbf{x}}{s} < q_n, \\ \lfloor\frac{\mathbf{x}}{s}\rceil - \frac{\mathbf x}{s}, q_n \leq \frac{\mathbf{x}}{s}\leq q_p,\\ q_p, \frac{\mathbf{x}}{s}>q_p. \end{cases}$

在lsq中，每层的权重和激活值都有不同的 $s$ ，被初始化为 $\frac{2\langle| \mathbf x|\rangle}{\sqrt{q_p}}$ 。

计算 $s$ 的梯度时，还需要兼顾模型权重的梯度，二者差异不能过大，lsq定义了如下比例：
$r=\frac{\nabla_sl}{s}/\frac{||\nabla_wl||}{||w||}\rightarrow1$ 。
为了保持训练的稳定，lsq在 $s$ 的梯度上还乘了一个梯度缩放系数 $g$ ，对于权重， $g=1/\sqrt{n_wq_p}$ ，对于激活， $g=1/\sqrt{n_fq_p}$ 。其中， $n_w$ 是一层中的权重的大小， $n_f$ 是一层中的特征的大小。

代码实现
参考：lsquantization复现

import torch
import torch.nn.functional as f
import math
from torch.autograd import variable

class funlsq(torch.autograd.function):
    @staticmethod
    def forward(ctx, weight, alpha, g, qn, qp):
        assert alpha > 0, 'alpha = {}'.format(alpha)
        ctx.save_for_backward(weight, alpha)
        ctx.other = g, qn, qp
        q_w = (weight / alpha).round().clamp(qn, qp)  # round+clamp将fp转化为int
        w_q = q_w * alpha  # 乘scale重量化回fp
        return w_q

    @staticmethod
    def backward(ctx, grad_weight):
        weight, alpha = ctx.saved_tensors
        g, qn, qp = ctx.other
        q_w = weight / alpha
        indicate_small = (q_w < qn).float()
        indicate_big = (q_w > qp).float()
        indicate_middle = torch.ones(indicate_small.shape).to(indicate_small.device) - indicate_small - indicate_big
        grad_alpha = ((indicate_small * qn + indicate_big * qp + indicate_middle * (
                -q_w + q_w.round())) * grad_weight * g).sum().unsqueeze(dim=0)  # 计算s梯度时的判断语句
        grad_weight = indicate_middle * grad_weight
        return grad_weight, grad_alpha, none, none, none

nbits = 4
qn = -2 ** (nbits - 1)
qp = 2 ** (nbits - 1) - 1
g = 1.0 / 2

2 lsq+: improving low-bit quantization through learnable offsets and better initialization (cvpr 2020)

lsq+和lsq非常相似，就放在一起讲了。lsq在lsq+的基础上，引入了可学习的offset，也就是零点 $z$ ，其定义如下：
$\mathbf x_{int}=\mathrm{clamp}(\lfloor\frac{\mathbf{x-\beta}}{s}\rceil; q_n, q_p)$
$\mathbf{\hat x}=s\mathbf x_{int}+\beta$
然后按照lsq的方式对 $s,\beta$ 求偏导数进行优化。

3 xnor-net: imagenet classification using binary convolutional neural networks

算是非常早期将二值（1-bit）表示引入神经网络的文章了，本文提出两种近似：

1）binary-weight-network：只有权重是1-bit

对于输入 $\mathbf i$ ，我们用二值滤波器 $\mathbf b\in \{+1, -1\}$ 和一个放缩因子 $\alpha$ 来近似真实浮点滤波器 $\mathbf w$ ： $\mathbf w\approx \alpha \mathbf b$ ，于是卷积的计算可以近似为：
$\mathbf i*\mathbf w\approx (\mathbf i\oplus \mathbf b)\alpha$
如何优化二值权重？我们的目标是找到 $\mathbf w=\alpha \mathbf b$ 的最优估计，解决如下优化问题：
$j(\mathbf b, \alpha)=||\mathbf w-\alpha \mathbf b||^2~~~~\alpha^*, \mathbf b^*=\mathrm{argmin_{\alpha, \mathbf b}}j(\mathbf b, \alpha)$
展开后得到：
在这里插入图片描述

其中， $\mathbf b^\top \mathbf b, \mathbf w^\top \mathbf w$ 都是常数，因此优化目标集中在第二项 $\mathbf w^\top \mathbf b$ 上：

在这里插入图片描述

这个优化问题的解可以是使 $\mathbf b=+1(\mathbf w\geq 0), \mathbf b=-1(\mathbf w< 0)$ ，原因是这样可以保持 $\mathbf w^\top \mathbf b$ 取最大值+1。因此，可以得到 $\mathbf b^*=\mathrm{sign}(\mathbf w)$ 。
然后，求解放缩因子 $\alpha$ 的最优解，我们用 $j$ 对 $\alpha$ 求偏导数：
$\frac{\partial j}{\partial \alpha}=2\alpha\mathbf b^\top\mathbf b-2\mathbf w^\top \mathbf b$

当偏导数等于0时，可求解：
$\alpha^*=\frac{\mathbf w^\top \mathbf b}{\mathbf b^\top \mathbf b}=\frac{\mathbf w^\top \mathbf b}{n}$

其中，令 $n=\mathbf b^\top \mathbf b$ ，此时的 $\mathbf b$ 代入 $\mathbf b^*$ ，于是：
$\alpha^*=\frac{\mathbf w^\top \mathbf b}{n}=\frac{\mathbf w^\top \mathrm{sign}(\mathbf w)}{n}=\frac{\sum |\mathbf w|}{n}=\frac{1}{n}||\mathbf w||_1$

其中， $||\cdot||_1$ 表示 $\ell_1$ -norm，即对矩阵中的所有元素的绝对值求和。

总结
二值权重/滤波器的最优估计是权重的符号函数值，放缩因子的最优估计是权重的绝对值平均值。

在这里插入图片描述

训练过程
需要注意的是，反向传播计算梯度用的近似的权重 $\tilde w$ ，而真正被更新的权重应该是真实的高精度权重 $w$ 。
在这里插入图片描述

2）xnor-networks：权重和激活值都是1-bit，乘法全部简化为异或计算

二值dot product计算
$\mathbf x^\top w\approx \beta \mathbf h^\top \alpha \mathbf b$ ，其中， $\mathbf h, \mathbf b\in \{-1, +1\}, \beta, \alpha\in\mathbb r^+$ ，优化目标如下：
在这里插入图片描述

令 $\mathbf y=\mathbf x \mathbf w, \mathbf c\in \{-1, +1\}, \mathbf c=\mathbf h \mathbf b, \gamma=\alpha\beta$ ，于是优化目标简化为：
在这里插入图片描述

根据binary-weight-network，通过符号函数可以求解最优的二值激活值和权重：
在这里插入图片描述

同理，根据，通过 $\ell_1$ -norm可以求解最优的放缩因子：
在这里插入图片描述

二值卷积计算
对于输入 $\mathbf i$ ，首先计算 $\mathbf a=\frac{\sum |\mathbf i_{:, :, i}|}{c}$ ，其中 $c$ 是输入通道数，这个过程计算了跨通道的输入 $\mathbf i$ 中元素的绝对值的平均值。然后将 $\mathbf i$ 和一个2d滤波器 $\mathbf k\in \mathbb r^{w\times h}$ 做卷积， $\mathbf k=\mathbf a * \mathbf k, \mathbf k_{ij}=\frac{1}{wh}$ 。 $\mathbf k$ 中包含了 $\mathbf i$ 中左右子张量的放缩因子 $\beta$ 。
于是，卷积的近似计算如下：
在这里插入图片描述