assignment1-Q4反向传播公式推导


反向传播很简单,但是对于矩阵的反向传播稍微有点复杂。对矩阵求导没有什么特别的地方,如果每一项都展开就很清楚了,下面我就来推导一下对于assignment1 Q4 中的两层神经网络的反向传播。

假设输入矩阵 \(X\) 是 NxD维的,其中 N 是 N 个数据,D是每个数据的维度。\(W_1\) 是 DxH维,偏置项 \(b_1\) 是H维,\(W_2\) 是 HxC维,C是输出的类别数,偏置项\(b_2\) 是 C维

令 \(Z = X W_1 + b_1\),加入激活函数(relu)后第一层神经网络的输出就是 \(A=max(0,Z)\)

需要注意的是 \(b_1\) 的形式是 NxH 维矩阵,如下所示

$$\begin{bmatrix} X_{11} & X_{12} & … & X_{1D} \\ X_{21} & X_{22} & … & X_{2D} \\ … & … & … & …\\ X_{N1} & X_{N2} & … & X_{ND}\\ \end{bmatrix} \begin{bmatrix} W^{1}_{11} & W^{1}_{12} & … & W^{1}_{1H} \\ W^{1}_{21} & W^{1}_{22} & … & W^{1}_{2H} \\ … & … & … & …\\ W^{1}_{D1} & W^1_{D2} & … & W^1_{DH}\\ \end{bmatrix} +
\begin{bmatrix} b^1_{1} & b^1_{2} & … & b^1_{H} \\ b^1_{1} & b^1_{2} & … & b^1_{H} \\ … & … & … & …\\ b^1_{1} & b^1_{2} & … & b^1_{H}\\ \end{bmatrix} = \begin{bmatrix} Z_{11} & Z_{12} & … & Z_{1H} \\ Z_{21} & Z_{22} & … & Z_{2H} \\ … & … & … & …\\ Z_{N1} & Z_{N2} & … & Z_{NH}\\ \end{bmatrix} $$

\(A\)=$$ \begin{bmatrix} A_{11} & A_{12} & … & A_{1H} \\ A_{21} & A_{22} & … & A_{2H} \\ … & … & … & …\\ A_{N1} & A_{N2} & … & A_{NH}\\ \end{bmatrix} $$

令 \(Y = A W_2 +b_2\),这个 \(Y\) 就是输出的 scores

$$ \begin{bmatrix} A_{11} & A_{12} & … & A_{1H} \\ A_{21} & A_{22} & … & A_{2H} \\ … & … & … & …\\ A_{N1} & A_{N2} & … & A_{NH}\\ \end{bmatrix} \begin{bmatrix} W^{2}_{11} & W^{1}_{12} & … & W^{1}_{1C} \\ W^{2}_{21} & W^{2}_{22} & … & W^{2}_{2C} \\ … & … & … & …\\ W^{2}_{H1} & W^2_{H2} & … & W^2_{HC}\\ \end{bmatrix} +
\begin{bmatrix} b^2_{1} & b^2_{2} & … & b^2_{C} \\ b^2_{1} & b^2_{2} & … & b^2_{C} \\ … & … & … & …\\ b^2_{1} & b^2_{2} & … & b^2_{C}\\ \end{bmatrix} = \begin{bmatrix} Y_{11} & Y_{12} & … & Y_{1C} \\ Y_{21} & Y_{22} & … & Y_{2C} \\ … & … & … & …\\ Y_{N1} & Y_{N2} & … & Y_{NC}\\ \end{bmatrix} $$

在 Y 之后我们用 sofmax 层将 scores 转换成 概率的形式,然后用交叉熵来表示 loss L

$$L = L_1 + L_2 … + L_N $$

对于第 i 个数据,它的loss \(L_i\) 是由 \(Y_{i1},Y_{i2},…,Y_{iC}\)贡献的

令 $$ \frac{\partial L}{\partial Y} = \begin{bmatrix} \frac{\partial L}{\partial Y_{11}} & \frac{\partial L}{\partial Y_{12}} & … & \frac{\partial L}{\partial Y_{1C}} \\ \frac{\partial L}{\partial Y_{21}} & \frac{\partial L}{\partial Y_{22}} & … & \frac{\partial L}{\partial Y_{2C}} \\ … & … & … & …\\ \frac{\partial L}{\partial Y_{N1}} & \frac{\partial L_N}{\partial Y_{N2}} & … & \frac{\partial L}{\partial Y_{NC}}\\ \end{bmatrix} = \begin{bmatrix} \frac{\partial L_1}{\partial Y_{11}} & \frac{\partial L_1}{\partial Y_{12}} & … & \frac{\partial L_1}{\partial Y_{1C}} \\ \frac{\partial L_2}{\partial Y_{21}} & \frac{\partial L_2}{\partial Y_{22}} & … & \frac{\partial L_2}{\partial Y_{2C}} \\ … & … & … & …\\ \frac{\partial L_N}{\partial Y_{N1}} & \frac{\partial L_N}{\partial Y_{N2}} & … & \frac{\partial L_N}{\partial Y_{NC}}\\ \end{bmatrix} $$

而 $$L_i = -log(\frac{e^{Y_{i y_i}}}{\sum_j e^{Y_{ij}}})$$

$$\frac{\partial L_i}{\partial Y_{ik}}=\frac{e^{Y_{ik}}}{\sum_j e^{Y_{ij}}}-1(y_i=k)$$

梯度往前传播:
例如对 \(W^2_{11}\)
$$\frac{\partial L}{\partial W^2_{11}}=\frac{\partial L}{\partial Y_{11}} \frac{\partial Y_{11}}{\partial W^2_{11}}+\frac{\partial L}{\partial Y_{21}} \frac{\partial Y_{21}}{\partial W^2_{11}}+…\frac{\partial L}{\partial Y_{N1}} \frac{\partial Y_{N1}}{\partial W^2_{11}}= \frac{\partial L_1}{\partial Y_{11}} \frac{\partial Y_{11}}{\partial W^2_{11}}+\frac{\partial L_2}{\partial Y_{21}} \frac{\partial Y_{21}}{\partial W^2_{11}}+…\frac{\partial L_N}{\partial Y_{N1}} \frac{\partial Y_{N1}}{\partial W^2_{11}}$$

这样我们就可以得到:
$$\frac{\partial L}{\partial W_2} = A^T \frac{\partial L}{\partial Y}$$

同理可得:
$$\frac{\partial L}{\partial A} = \frac{\partial L}{\partial Y} W_2^T$$

对 \(b^2_1 \):
$$\frac{\partial L}{\partial b^2_1}=\frac{\partial L_1}{\partial Y_{11}}+… + \frac{\partial L_N}{\partial Y_{N1}}$$

所以:
$$\frac{\partial L}{\partial b_2} = \begin{bmatrix} \frac{\partial L}{\partial b^2_1} \\ \frac{\partial L}{\partial b^2_2} \\ … \\ \frac{\partial L}{\partial b^2_C}\\ \end{bmatrix} = \begin{bmatrix} \frac{\partial L_1}{\partial Y_{11}}+… + \frac{\partial L_N}{\partial Y_{N1}} \\ \frac{\partial L_1}{\partial Y_{12}}+… + \frac{\partial L_N}{\partial Y_{N2}} \\ … \\ \frac{\partial L_1}{\partial Y_{1C}}+… + \frac{\partial L_N}{\partial Y_{NC}}\\ \end{bmatrix} $$

再往前传播,遇到 relu 层:
例如对 \(Z_{11}\):
$$\frac{\partial L}{\partial Z_{11}}=\frac{\partial L}{\partial A_{11}} \cdot 1(Z_{11} > 0)$$

所以:
$$\frac{\partial L}{\partial Z}=\frac{\partial L}{\partial A} \odot 1(Z > 0)$$

那么同理有:
$$\frac{\partial L}{\partial W_1} = X^T \frac{\partial L}{\partial Z}$$

$$\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Z} W_1^T$$

$$\frac{\partial L}{\partial b_1} = \begin{bmatrix} \frac{\partial L}{\partial b^1_1} \\ \frac{\partial L}{\partial b^1_2} \\ … \\ \frac{\partial L}{\partial b^1_H}\\ \end{bmatrix} = \begin{bmatrix} \frac{\partial L_1}{\partial Z_{11}}+… + \frac{\partial L_N}{\partial Z_{N1}} \\ \frac{\partial L_1}{\partial Z_{12}}+… + \frac{\partial L_N}{\partial Z_{N2}} \\ … \\ \frac{\partial L_1}{\partial Z_{1H}}+… + \frac{\partial L_N}{\partial Z_{NH}}\\ \end{bmatrix} $$

至此反向传播全部结束,如果层数增加的话也是类似的,只要逐层向前传播就行。


文章作者: lovelyfrog
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 lovelyfrog !
 上一篇
数字信号处理知识点总结 数字信号处理知识点总结
这学期数字信号处理主要讲授以奥本海姆所著的《信号与系统的》前七章:信号与系统,线性时不变系统,周期信号的傅立叶级数表示,连续时间傅立叶变换,离散时间傅立叶变换,信号与系统的时域和频域特性,采样这七个章节的内容。以及胡广书所著《数字信号
下一篇 
数字信号处理笔记1 数字信号处理笔记1
线性时不变系统对复指数信号的响应:输出仍然是复指数信号。如果一个线性时不变系统的输入能够表示成复指数的线性组合,那么系统的输出也能够表示成相同复指数信号的线性组合。 那么问题来了,究竟有多大范围的信号可以用复指数的线性组合来表示? 用复
  目录