Layer Normalization

Background

使用 Batch Normalization 存在问题: The effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks
Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer. This suggests the “covariate shift” problem can be reduced by fixing the mean and the variance of the summed inputs within each layer.

Layer Normalization

我们对同一层的所有 hidden units 计算层 normalization statistics:

$\mu^l = \frac{1}{H}\sum_{i=1}^H a_i^l \tag{1}$ $\sigma^l = \sqrt{\frac{1}{H}\sum_{i=1}^H(a_i^l-\mu^l)^2} \tag{2}$

其中， $H$ 表示层中的 hidden units 总数。与 Batch Normalization 不同，对于层中所有的 units 服从相同的 $\mu, \sigma$ 的分布，而不同训练样本层的单元使用不同的 $\mu,\sigma$ 进行规范化。这样。layer normalization 对于 mini-batch 的大小不再有约束，并且可以在pure online regime（mini-batch size = 1）中使用。

Implementation from scratch

Forward pass:
- Input:
  - $X$ : data of shape (N, D)
  - $\gamma$ : Scale parameter of shape (D,)
  - $\beta$ : Shift parameter of shape (D,)
- Step one :
  - 计算 layer_mean : mean value of each sample in shape (N, 1)
  - 计算 layer_var : variance of each sample in shape (N, 1)
- Step two :
  - 计算 $\hat{X} = \frac{X - layer\_mean}{\epsilon+ \sqrt{layer\_var}}$ .
- Step three:
  - 计算 $out = \gamma * \hat{X} + \beta$
  - Note : 这里我很疑惑为什么 $\gamma, \beta$ 是(D)维向量而不是 (N) 维向量。因为直觉上和Batch Normalization 相似的应该能使Layer Normalization 成为恒等变换。而(N) 维的 $\gamma, \beta$ 更符合常理。

def layernorm_forward(x, gamma, beta, ln_param):
    """
    Forward pass for layer normalization.

    During both training and test-time, the incoming data is normalized per data-point,
    before being scaled by gamma and beta parameters identical to that of batch normalization.

    Note that in contrast to batch normalization, the behavior during train and test-time for
    layer normalization are identical, and we do not need to keep track of running averages
    of any sort.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - ln_param: Dictionary with the following keys:
        - eps: Constant for numeric stability

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    out, cache = None, None
    eps = ln_param.get('eps', 1e-5)
    ###########################################################################
    # TODO: Implement the training-time forward pass for layer norm.          #
    # Normalize the incoming data, and scale and  shift the normalized data   #
    #  using gamma and beta.                                                  #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of  batch normalization, and inserting a line or two of  #
    # well-placed code. In particular, can you think of any matrix            #
    # transformations you could perform, that would enable you to copy over   #
    # the batch norm code and leave it almost unchanged?                      #
    ###########################################################################

    layer_mean = np.mean(x, axis=1, keepdims=True)
    layer_var = np.var(x, axis=1, keepdims=True)
    # print(layer_mean.shape, layer_var.shape)
    hat_x = (x - layer_mean) / np.sqrt(layer_var + eps)
    # print(hat_x.shape, x.shape)
    out = gamma * hat_x + beta
    cache = (x, gamma, beta, hat_x, layer_mean, layer_var, eps)

    return out, cache

Backward pass:

完全类似于Batch Normalization : 这里给出代码和求导公式：
$\frac{\partial l}{\partial \hat{x}_i} = \frac{\partial l}{\partial y} \cdot \gamma \tag{3}$ $\frac{\partial l}{\partial \sigma_{\beta}^2} = \sum_{i=1}^{D}\frac{\partial l}{\partial \hat{x}_i} \cdot (x_i - \mu_\beta) \cdot \frac{-1}2 (\sigma_{\beta}^2 + \epsilon)^{-3/2} \tag{4}$ $\frac{\partial l}{\partial \mu_\beta} = (\sum_{i=1}^{D} \frac{\partial l}{\partial \hat{x}_i } ) \cdot \frac{-1}{\sqrt{\sigma_{\beta}^2 + \epsilon}} + \frac{\partial l}{\partial \sigma_{\beta}^2}\cdot\frac{\sum_{i=1}^D-2(x_i-\mu_\beta)}{D} \tag{5}$ $\frac{\partial l}{\partial x_i} = \frac{\partial l}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_{\beta}^2 + \epsilon}} + \frac{\partial l}{\partial \sigma_{\beta}^2} \cdot \frac{2(x_i-\mu_\beta)}{m} + \frac{\partial l}{\partial \mu_\beta} \cdot \frac{1}{m} \tag{6}$ $\frac{\partial l}{\partial \gamma} = \sum_{i=1}^D\frac{\partial l}{\partial y_i}\cdot \hat{x}_i \tag{7}$ $\frac{\partial l}{\partial \beta} =\sum_{i=1}^D\frac{\partial l}{\partial y_i} \tag{8}$

def layernorm_backward(dout, cache):
    """
    Backward pass for layer normalization.

    For this implementation, you can heavily rely on the work you've done already
    for batch normalization.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from layernorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for layer norm.                       #
    #                                                                         #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of batch normalization. The hints to the forward pass    #
    # still apply!                                                            #
    ###########################################################################

    x, gamma, beta, hat_x, layer_mean, layer_var, eps = cache
    num_sample, num_depth = x.shape

    dhat_x = dout * gamma

    dvar = np.sum(dhat_x * (x - layer_mean), axis=1, keepdims=True) * (-1/2) * (layer_var+eps) ** (-3/2)
    dmean = -np.sum(dhat_x, axis=1, keepdims=True) / np.sqrt(layer_var+eps) + \
            dvar * np.sum(x-layer_mean, axis=1, keepdims=True) * (-2/num_depth)

    dx = dhat_x / np.sqrt(layer_var+eps) + dvar * (2/num_depth) * (x-layer_mean) + dmean / num_depth

    dgamma = np.sum(dout * hat_x, axis=0)
    dbeta = np.sum(dout, axis=0)

    return dx, dgamma, dbeta

Jason Yuan

http://Columbine21.github.io/2020/03/19/testfomular/