向量空间 的一组基:
张成 (span) 该空间的一个线性无关 (linearly independent) 向量集.
线性变换 是指一个向量空间到另一个向量空间的映射,
满足加法和数乘运算的线性性质:
L ( α v ⃗ + β w ⃗ ) = α L ( v ⃗ ) + β L ( w ⃗ ) \begin{equation}
L(\alpha\vec{v}+\beta\vec{w})=\alpha{L(\vec{v})}+\beta{L(\vec{w})}
\end{equation} L ( α v + β w ) = α L ( v ) + β L ( w )
Matrix representation :
v ⃗ = x i ⃗ + y j ⃗ = [ x y ] A v ⃗ = x i ^ + y j ^ = x [ a c ] i ⃗ + y [ b d ] j ⃗ = [ a b c d ] [ x y ] A 2 A 1 = A 2 i ^ + A 2 j ^ = ( [ a 2 b 2 c 2 d 2 ] [ a 1 c 1 ] ) i ⃗ + ( [ a 2 b 2 c 2 d 2 ] [ b 1 d 1 ] ) j ⃗ = ( a 1 [ a 2 c 2 ] + c 1 [ b 2 d 2 ] ) i ⃗ + ( b 1 [ a 2 c 2 ] + d 1 [ b 2 d 2 ] ) j ⃗ = [ a 2 a 1 + b 2 c 1 a 2 b 1 + b 2 d 1 c 2 a 1 + d 2 c 1 c 2 b 1 + d 2 d 1 ] \begin{split}
\vec{v}&=x\vec{i}+y\vec{j} \\
&=\begin{bmatrix}x \\ y\end{bmatrix} \\
A\vec{v}&=x\hat{i}+y\hat{j} \\
&=x\begin{bmatrix}a \\ c\end{bmatrix}\vec{i}
+y\begin{bmatrix}b \\ d\end{bmatrix}\vec{j} \\
&=\begin{bmatrix}a & b \\ c & d\end{bmatrix}
\begin{bmatrix}x \\ y\end{bmatrix} \\
A_2A_1&=A_2\hat{i}+A_2\hat{j} \\
&=(\begin{bmatrix}a_2 & b_2 \\ c_2 & d_2\end{bmatrix}
\begin{bmatrix}a_1 \\ c_1 \end{bmatrix})\vec{i}
+(\begin{bmatrix}a_2 & b_2 \\ c_2 & d_2\end{bmatrix}
\begin{bmatrix}b_1 \\ d_1\end{bmatrix})\vec{j} \\
&=(a_1\begin{bmatrix}a_2 \\ c_2\end{bmatrix}
+c_1\begin{bmatrix}b_2 \\ d_2\end{bmatrix})\vec{i}
+(b_1\begin{bmatrix}a_2 \\ c_2\end{bmatrix}
+d_1\begin{bmatrix}b_2 \\ d_2\end{bmatrix})\vec{j} \\
&=\begin{bmatrix}
a_2a_1+b_2c_1 & a_2b_1+b_2d_1 \\
c_2a_1+d_2c_1 & c_2b_1+d_2d_1
\end{bmatrix}
\end{split} v A v A 2 A 1 = x i + y j = [ x y ] = x i ^ + y j ^ = x [ a c ] i + y [ b d ] j = [ a c b d ] [ x y ] = A 2 i ^ + A 2 j ^ = ( [ a 2 c 2 b 2 d 2 ] [ a 1 c 1 ] ) i + ( [ a 2 c 2 b 2 d 2 ] [ b 1 d 1 ] ) j = ( a 1 [ a 2 c 2 ] + c 1 [ b 2 d 2 ] ) i + ( b 1 [ a 2 c 2 ] + d 1 [ b 2 d 2 ] ) j = [ a 2 a 1 + b 2 c 1 c 2 a 1 + d 2 c 1 a 2 b 1 + b 2 d 1 c 2 b 1 + d 2 d 1 ]
左乘矩阵相当于对列向量进行线性变换,
右乘矩阵相当于对行向量进行线性变换.
A m × n A_{m\times n} A m × n 表示 n 维空间到 m 维空间的线性变换:
n 列: 输入空间有 n 个基向量, 即为 n 维空间.
m 行: 输出空间每个基向量对应 m 个坐标, 即为 m 维空间.
A 1 × n A_{1\times n} A 1 × n 表示 n 维空间到一维空间的线性变换:
向量点乘 (Dot Product) v ⃗ ⋅ w ⃗ \vec{v} \cdot \vec{w} v ⋅ w 可以理解为
w ⃗ \vec{w} w 通过 V 1 × n V_{1\times n} V 1 × n 变换到一维空间后的投影.
Dot Product and Cross Product
Dot Product: v ⃗ ⋅ w ⃗ = ∥ v ⃗ ∥ ∥ w ⃗ ∥ cos θ \vec{v} \cdot \vec{w}=\|\vec{v}\|\|\vec{w}\|\cos{\theta} v ⋅ w = ∥ v ∥∥ w ∥ cos θ .
Cross Product: ∥ v ⃗ × w ⃗ ∥ = ∥ v ⃗ ∥ ∥ w ⃗ ∥ sin θ \|\vec{v} \times \vec{w}\|=\|\vec{v}\|\|\vec{w}\|\sin{\theta} ∥ v × w ∥ = ∥ v ∥∥ w ∥ sin θ .
v ⃗ ⋅ ( v ⃗ × w ⃗ ) = 0 \vec{v}\cdot(\vec{v}\times\vec{w})=0 v ⋅ ( v × w ) = 0 ,
w ⃗ ⋅ ( v ⃗ × w ⃗ ) = 0 \vec{w}\cdot(\vec{v}\times\vec{w})=0 w ⋅ ( v × w ) = 0 .
Basis changes, translating transformations :
v p ⃗ = P − 1 A P w p ⃗ \vec{v_p}=P^{-1}AP\vec{w_p} v p = P − 1 A P w p
det ( A ) \det(A) det ( A ) 表示矩阵 A 的行列式 ,
表示该变换对空间的缩放因子:
det ( A ) = 0 \det(A)=0 det ( A ) = 0 时, 表示该变换将空间压缩到一个低维空间,
称矩阵 A A A 为奇异矩阵 (Singular Matrix):
矩阵 A A A 列向量线性相关.
矩阵 A A A 不满秩 (Not full rank).
矩阵 A A A 不可逆.
Determinant for 2d matrix:
∣ a b c d ∣ = a d − b c \begin{equation}
\begin{vmatrix}a & b \\ c & d\end{vmatrix}=ad-bc
\end{equation} a c b d = a d − b c
Determinant for 3d matrix:
∣ a b c d e f g h i ∣ = a ∣ e f h i ∣ − b ∣ d f g i ∣ + c ∣ d e g h ∣ \begin{equation}
\begin{vmatrix}a & b & c \\ d & e & f \\ g & h & i\end{vmatrix}
=a\begin{vmatrix}e & f \\ h & i\end{vmatrix}
-b\begin{vmatrix}d & f \\ g & i\end{vmatrix}
+c\begin{vmatrix}d & e \\ g & h\end{vmatrix}
\end{equation} a d g b e h c f i = a e h f i − b d g f i + c d g e h
Determinant for matrix multiplication:
det ( A 1 A 2 ) = det ( A 1 ) det ( A 2 ) \begin{equation}
\det(A_1A_2)=\det(A_1)\det(A_2)
\end{equation} det ( A 1 A 2 ) = det ( A 1 ) det ( A 2 )
高斯消元法求解线性方程组 (Linear System Of Equations):
首先第一行的第一个元素化为 1,
下面每行减去第一行乘以该行第一个元素的倍数,
从而把第一列除第一行外的全部元素都化为 0,
进而把第二列除前两个元素之外的元素都化为 0,
最后把矩阵化为上三角矩阵.
类似地, 从最后一行开始, 逐行把上三角矩阵化为单位矩阵.
A x ⃗ = v ⃗ A − 1 A x ⃗ = A − 1 v ⃗ x ⃗ = A − 1 v ⃗ \begin{split}
A\vec{x}&=\vec{v} \\
A^{-1}A\vec{x}&=A^{-1}\vec{v} \\
\vec{x}&=A^{-1}\vec{v}
\end{split} A x A − 1 A x x = v = A − 1 v = A − 1 v
A = [ a b c d ] A=\begin{bmatrix}
a & b \\
c & d
\end{bmatrix} A = [ a c b d ]
eigenvalue A v ⃗ = λ v ⃗ A\vec{v}=\lambda\vec{v} A v = λ v quick calculation :
λ = m ± m 2 − p = λ 1 + λ 2 2 ± ( λ 1 + λ 2 2 ) 2 − λ 1 λ 2 = a + d 2 ± ( a + d 2 ) 2 − ( a d − b c ) \begin{split}
\lambda&=m\pm\sqrt{m^2-p} \\
&=\frac{\lambda_1+\lambda_2}{2}
\pm\sqrt{(\frac{\lambda_1+\lambda_2}{2})^2-\lambda_1\lambda_2} \\
&=\frac{a+d}{2}\pm\sqrt{(\frac{a+d}{2})^2-(ad-bc)}
\end{split} λ = m ± m 2 − p = 2 λ 1 + λ 2 ± ( 2 λ 1 + λ 2 ) 2 − λ 1 λ 2 = 2 a + d ± ( 2 a + d ) 2 − ( a d − b c )
洛必达法则是求解分数形式的未定型极限 lim x → a 0 0 \lim_{x\to{a}}\frac{0}{0} lim x → a 0 0 的有效方法之一:
lim x → a f ( x ) g ( x ) = lim x → a d f ( x ) d g ( x ) = lim x → a d f d x ( a ) d x d g d x ( a ) d x = lim x → a d f d x ( a ) d g d x ( a ) = lim x → a f ′ ( a ) g ′ ( a ) \begin{equation}
\begin{split}
\lim_{x\to{a}}\frac{f(x)}{g(x)}
&=\lim_{x\to{a}}\frac{df(x)}{dg(x)} \\
&=\lim_{x\to{a}}\frac{\frac{df}{dx}(a)dx}{\frac{dg}{dx}(a)dx} \\
&=\lim_{x\to{a}}\frac{\frac{df}{dx}(a)}{\frac{dg}{dx}(a)} \\
&=\lim_{x\to{a}}\frac{f'(a)}{g'(a)}
\end{split}
\end{equation} x → a lim g ( x ) f ( x ) = x → a lim d g ( x ) df ( x ) = x → a lim d x d g ( a ) d x d x df ( a ) d x = x → a lim d x d g ( a ) d x df ( a ) = x → a lim g ′ ( a ) f ′ ( a )
常见导数:
d d x x n = n x n − 1 d d x sin x = cos x d d x cos x = − sin x d d x a x = a x ln a d d x e x = e x d d x log a x = 1 x ln a d d x ln x = 1 x d d x ( g ( x ) + h ( x ) ) = g ′ ( x ) + h ′ ( x ) d d x ( g ( x ) h ( x ) ) = g ′ ( x ) h ( x ) + g ( x ) h ′ ( x ) d d x f ( g ( x ) ) = f ′ ( g ( x ) ) g ′ ( x ) d d x f − 1 ( x ) = 1 f ′ ( f − 1 ( x ) ) d d x ∫ a ( x ) b ( x ) f ( t ) d t = f ( b ( x ) ) b ′ ( x ) − f ( a ( x ) ) a ′ ( x ) \begin{equation}
\begin{split}
\frac{d}{dx}x^n&=nx^{n-1} \\
\frac{d}{dx}\sin{x}&=\cos{x} \\
\frac{d}{dx}\cos{x}&=-\sin{x} \\
\frac{d}{dx}a^x&=a^x\ln{a} \\
\frac{d}{dx}e^x&=e^x \\
\frac{d}{dx}\log_a{x}&=\frac{1}{x\ln{a}} \\
\frac{d}{dx}\ln{x}&=\frac{1}{x} \\
\frac{d}{dx}(g(x)+h(x))&=g'(x)+h'(x) \\
\frac{d}{dx}(g(x)h(x))&=g'(x)h(x)+g(x)h'(x) \\
\frac{d}{dx}f(g(x))&=f'(g(x))g'(x) \\
\frac{d}{dx}f^{-1}(x)&=\frac{1}{f'(f^{-1}(x))} \\
\frac{d}{dx}\int_{a(x)}^{b(x)}f(t)dt&=f(b(x))b'(x)-f(a(x))a'(x)
\end{split}
\end{equation} d x d x n d x d sin x d x d cos x d x d a x d x d e x d x d log a x d x d ln x d x d ( g ( x ) + h ( x )) d x d ( g ( x ) h ( x )) d x d f ( g ( x )) d x d f − 1 ( x ) d x d ∫ a ( x ) b ( x ) f ( t ) d t = n x n − 1 = cos x = − sin x = a x ln a = e x = x ln a 1 = x 1 = g ′ ( x ) + h ′ ( x ) = g ′ ( x ) h ( x ) + g ( x ) h ′ ( x ) = f ′ ( g ( x )) g ′ ( x ) = f ′ ( f − 1 ( x )) 1 = f ( b ( x )) b ′ ( x ) − f ( a ( x )) a ′ ( x )
泰勒级数利用函数在某点的各阶导数, 近似该点附近函数的值:
1 1 − x = ∑ n = 0 ∞ x n ∣ x ∣ < 1 e x = ∑ n = 0 ∞ x n n ! ln ( 1 + x ) = ∑ n = 1 ∞ ( − 1 ) n − 1 n x n x ∈ ( − 1 , 1 ] sin ( x ) = ∑ n = 0 ∞ ( − 1 ) n ( 2 n + 1 ) ! x 2 n + 1 cos ( x ) = ∑ n = 0 ∞ ( − 1 ) n ( 2 n ) ! x 2 n f ( x ) = ∑ n = 0 ∞ f ( n ) ( x 0 ) n ! ( x − x 0 ) n = f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + f ′ ′ ( x 0 ) 2 ! ( x − x 0 ) 2 + … \begin{equation}
\begin{split}
\frac{1}{1-x}&=\sum\limits_{n=0}^{\infty}x^n \quad |x|\lt1 \\
e^x&=\sum\limits_{n=0}^{\infty}\frac{x^n}{n!} \\
\ln(1+x)&=\sum\limits_{n=1}^{\infty}\frac{(-1)^{n-1}}{n}x^n \quad x\in(-1,1] \\
\sin(x)&=\sum\limits_{n=0}^{\infty}\frac{(-1)^n}{(2n+1)!}x^{2n+1} \\
\cos(x)&=\sum\limits_{n=0}^{\infty}\frac{(-1)^n}{(2n)!}x^{2n} \\
f(x)&=\sum\limits_{n=0}^{\infty}\frac{f^{(n)(x_0)}}{n!}(x-x_0)^n \\
&=f(x_0)+f'(x_0)(x-x_0)+\frac{f''(x_0)}{2!}(x-x_0)^2+\dots
\end{split}
\end{equation} 1 − x 1 e x ln ( 1 + x ) sin ( x ) cos ( x ) f ( x ) = n = 0 ∑ ∞ x n ∣ x ∣ < 1 = n = 0 ∑ ∞ n ! x n = n = 1 ∑ ∞ n ( − 1 ) n − 1 x n x ∈ ( − 1 , 1 ] = n = 0 ∑ ∞ ( 2 n + 1 )! ( − 1 ) n x 2 n + 1 = n = 0 ∑ ∞ ( 2 n )! ( − 1 ) n x 2 n = n = 0 ∑ ∞ n ! f ( n ) ( x 0 ) ( x − x 0 ) n = f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + 2 ! f ′′ ( x 0 ) ( x − x 0 ) 2 + …
复数平面 (Complex Plane) 上的圆周运动:
e i x = cos x + i sin x \begin{equation}
e^{ix}=\cos{x}+i\sin{x}
\end{equation} e i x = cos x + i sin x
Time to frequency transform:
f ^ ( ξ ) = ∫ − ∞ ∞ f ( t ) e − 2 π i ξ t d t \begin{equation}
\hat{f}(\xi)=\int_{-\infty}^{\infty}f(t)e^{-2\pi i\xi t}dt
\end{equation} f ^ ( ξ ) = ∫ − ∞ ∞ f ( t ) e − 2 πi ξ t d t
Discrete Fourier Transform (DFT):
X [ k ] = ∑ n = 0 N − 1 x n e − i 2 π N k n \begin{equation}
X[k]=\sum\limits_{n=0}^{N-1}x_n e^{-\frac{i2\pi}{N}kn}
\end{equation} X [ k ] = n = 0 ∑ N − 1 x n e − N i 2 π kn
outcomes
[ 1 1 1 … 1 1 e 2 π i n e 2 π i ( 2 ) n … e 2 π i ( n − 1 ) n 1 e 2 π i ( 2 ) n e 2 π i ( 4 ) n … e 2 π i ( 2 ) ( n − 1 ) n ⋮ ⋮ ⋮ ⋱ ⋮ 1 e 2 π i ( n − 1 ) n e 2 π i ( 2 ) ( n − 1 ) n … e 2 π i ( n − 1 ) ( n − 1 ) n ] \begin{bmatrix}
1 & 1 & 1 & \dots & 1 \\
1 & e^{\frac{2\pi i}{n}} & e^{\frac{2\pi i(2)}{n}}
& \dots & e^{\frac{2\pi i(n-1)}{n}} \\
1 & e^{\frac{2\pi i(2)}{n}} & e^{\frac{2\pi i(4)}{n}}
& \dots & e^{\frac{2\pi i(2)(n-1)}{n}} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & e^{\frac{2\pi i(n-1)}{n}} & e^{\frac{2\pi i(2)(n-1)}{n}}
& \dots & e^{\frac{2\pi i(n-1)(n-1)}{n}}
\end{bmatrix} 1 1 1 ⋮ 1 1 e n 2 πi e n 2 πi ( 2 ) ⋮ e n 2 πi ( n − 1 ) 1 e n 2 πi ( 2 ) e n 2 πi ( 4 ) ⋮ e n 2 πi ( 2 ) ( n − 1 ) … … … ⋱ … 1 e n 2 πi ( n − 1 ) e n 2 πi ( 2 ) ( n − 1 ) ⋮ e n 2 πi ( n − 1 ) ( n − 1 )
微分方程 (Differential Equation) 是描述变量之间关系的方程,
通常包含未知函数及其导数, 用于描述物理现象和自然规律.
一阶微分方程:
d d t [ x ( t ) y ( t ) ] = [ a b c d ] [ x ( t ) y ( t ) ] ⇒ [ x ( t ) y ( t ) ] = e [ a b c d ] t [ x ( 0 ) y ( 0 ) ] \begin{equation}
\frac{d}{dt}\begin{bmatrix}x(t)\\y(t)\end{bmatrix}
=\begin{bmatrix}a&b\\c&d\end{bmatrix}\begin{bmatrix}x(t)\\y(t)\end{bmatrix}
\Rightarrow
\begin{bmatrix}x(t)\\y(t)\end{bmatrix}
=e^{\begin{bmatrix}a&b\\c&d\end{bmatrix}t}\begin{bmatrix}x(0)\\y(0)\end{bmatrix}
\end{equation} d t d [ x ( t ) y ( t ) ] = [ a c b d ] [ x ( t ) y ( t ) ] ⇒ [ x ( t ) y ( t ) ] = e [ a c b d ] t [ x ( 0 ) y ( 0 ) ]
if v ⃗ ( t ) = e M t v ⃗ 0 then d d t v ⃗ ( t ) = d d t e M t v ⃗ 0 = d d t ∑ n = 0 ∞ M n n ! t n v ⃗ 0 = ∑ n = 0 ∞ M n n ! n t n − 1 v ⃗ 0 = M ∑ n = 0 ∞ M n − 1 ( n − 1 ) ! t n − 1 v ⃗ 0 = M e M t v ⃗ 0 = M v ⃗ ( t ) \begin{split}
\text{if} \quad \vec{v}(t)&=e^{Mt}\vec{v}_0 \\
\text{then} \quad
\frac{d}{dt}\vec{v}(t)
&=\frac{d}{dt}e^{Mt}\vec{v}_0 \\
&=\frac{d}{dt}\sum\limits_{n=0}^{\infty}\frac{M^n}{n!}t^n\vec{v}_0 \\
&=\sum\limits_{n=0}^{\infty}\frac{M^n}{n!}nt^{n-1}\vec{v}_0 \\
&=M\sum\limits_{n=0}^{\infty}\frac{M^{n-1}}{(n-1)!}t^{n-1}\vec{v}_0 \\
&=Me^{Mt}\vec{v}_0 \\
&=M\vec{v}(t)
\end{split} if v ( t ) then d t d v ( t ) = e Mt v 0 = d t d e Mt v 0 = d t d n = 0 ∑ ∞ n ! M n t n v 0 = n = 0 ∑ ∞ n ! M n n t n − 1 v 0 = M n = 0 ∑ ∞ ( n − 1 )! M n − 1 t n − 1 v 0 = M e Mt v 0 = M v ( t )
x ¨ ( t ) = − μ x ˙ ( t ) − ω x ( t ) \begin{equation}
\ddot{x}(t)=-\mu\dot{x}(t)-\omega x(t)
\end{equation} x ¨ ( t ) = − μ x ˙ ( t ) − ω x ( t )
Gravitational force equation:
y ¨ ( t ) = − g , y ˙ ( t ) = − g t + v 0 d x ⃗ 1 d t = v ⃗ 1 , d v ⃗ 1 d t = G m 2 ( x ⃗ 2 − x ⃗ 1 ∥ x ⃗ 2 − x ⃗ 1 ∥ ) ( 1 ∥ x ⃗ 2 − x ⃗ 1 ∥ 2 ) θ ¨ ( t ) = − μ θ ˙ ( t ) − g L sin ( θ ( t ) ) \begin{split}
\ddot{y}(t)=-g, \quad &
\dot{y}(t)=-gt+v_0
\\
\frac{d\vec{x}_1}{dt}=\vec{v}_1, \quad &
\frac{d\vec{v}_1}{dt}=Gm_2\Big(\frac{\vec{x}_2-\vec{x}_1}{\|\vec{x}_2-\vec{x}_1\|}\Big)\Big(\frac{1}{\|\vec{x}_2-\vec{x}_1\|^2}\Big)
\\
& \ddot{\theta}(t)=-\mu\dot{\theta}(t)-\frac{g}{L}\sin\big({\theta}(t)\big)
\end{split} y ¨ ( t ) = − g , d t d x 1 = v 1 , y ˙ ( t ) = − g t + v 0 d t d v 1 = G m 2 ( ∥ x 2 − x 1 ∥ x 2 − x 1 ) ( ∥ x 2 − x 1 ∥ 2 1 ) θ ¨ ( t ) = − μ θ ˙ ( t ) − L g sin ( θ ( t ) )
热传导方程 :
∂ T ∂ t ( x , t ) = α ∂ 2 T ∂ x 2 ( x , t ) \frac{\partial{T}}{\partial{t}}(x,t)=\alpha\frac{\partial^2{T}}{\partial{x^2}}(x,t) ∂ t ∂ T ( x , t ) = α ∂ x 2 ∂ 2 T ( x , t )
Black-Scholes / Merton equation:
∂ V ∂ t + r S ∂ V ∂ S + 1 2 σ 2 S 2 ∂ 2 V ∂ S 2 − r V = 0 \frac{\partial{V}}{\partial{t}}+rS\frac{\partial{V}}{\partial{S}}+\frac{1}{2}\sigma^2S^2\frac{\partial^2{V}}{\partial{S^2}}-rV=0 ∂ t ∂ V + r S ∂ S ∂ V + 2 1 σ 2 S 2 ∂ S 2 ∂ 2 V − r V = 0
相空间是描述系统状态的空间,
每个点代表系统的一个状态, 点的轨迹描述了系统的演化.
import numpy as np g = 9.8 L = 2 mu = 0.1 THETA_0 = np . pi / 3 THETA_DOT_0 = 0 def get_theta_double_dot ( theta , theta_dot ) : return - mu * theta_dot - ( g / L ) * np . sin ( theta ) def theta ( t ) : theta = THETA_0 theta_dot = THETA_DOT_0 delta_t = 0.01 for _ in np . arange ( 0 , t , delta_t ) : theta_double_dot = get_theta_double_dot ( theta , theta_dot ) theta += theta_dot * delta_t theta_dot += theta_double_dot * delta_t return theta
若随机变量 X X X 服从一个位置参数为 μ \mu μ , 尺度参数为 σ \sigma σ 的概率分布,
且其概率密度函数 (Probability Density Function, PDF) 为:
f ( x ) = 1 σ 2 π e − 1 2 ( x − μ σ ) 2 \begin{equation}
f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}
\end{equation} f ( x ) = σ 2 π 1 e − 2 1 ( σ x − μ ) 2
则这个随机变量称为正态随机变量, 正态随机变量服从的分布称为正态分布,
记作 X ∼ N ( μ , σ 2 ) X \sim N(\mu,\sigma^2) X ∼ N ( μ , σ 2 ) , 读作 X X X 服从 N ( μ , σ 2 ) N(\mu,\sigma^2) N ( μ , σ 2 ) (正态分布).
其中 μ \mu μ 为均值 (数学期望 Mean), σ \sigma σ 为标准差 (Standard Deviation).
正态分布 (又称 Gaussian Distribution) 是一种连续概率分布.
当 μ \mu μ 为 0, σ \sigma σ 为 1 时, 称为标准正态分布 (Standard Normal Distribution).
在自然界与生产中, 一些现象受到许多相互独立 的随机因素的影响,
如果每个因素所产生的影响都很微小时, 总影响 (Sum) 可以看作服从正态分布.
相互独立的正态分布, 其和也是正态分布.
总体正态分布的均值等于各个分布的均值之和,
E ( X 1 + ⋯ + X n ) = E ( X 1 ) + ⋯ + E ( X n ) = n μ E(X_1+\dots+X_n)=E(X_1)+\dots+E(X_n)=n\mu E ( X 1 + ⋯ + X n ) = E ( X 1 ) + ⋯ + E ( X n ) = n μ .
假设协方差为 0, 则总体正态分布的方差等于各个分布的方差之和,
V a r ( X 1 + ⋯ + X n ) = V a r ( X 1 ) + ⋯ + V a r ( X n ) = n σ 2 {Var(X_1+\dots+X_n)}={Var(X_1)+\dots+Var(X_n)}={n}\sigma^2 Va r ( X 1 + ⋯ + X n ) = Va r ( X 1 ) + ⋯ + Va r ( X n ) = n σ 2 ,
可以得到总体正态分布的标准差为 n σ \sqrt{n}\sigma n σ .
设随机变量 X 1 , X 2 , … , X n X_1,X_2,\dots,X_n X 1 , X 2 , … , X n 独立同分布(Independent Identically Distribution),
且均值为 E ( X i ) = μ E(X_i)=\mu E ( X i ) = μ , 方差为 D ( X i ) = σ 2 D(X_i)=\sigma^2 D ( X i ) = σ 2 ,
对于任意 x x x , 其分布函数为
F n ( x ) = P { ∑ i = 1 n X i − n μ n σ ≤ x } F_n(x)=P\left\{\frac{\sum_{i=1}^n{X_i}-n\mu}{\sqrt{n}\sigma}\leq{x}\right\} F n ( x ) = P { n σ ∑ i = 1 n X i − n μ ≤ x }
满足
lim n → ∞ F n ( x ) = lim n → ∞ P { ∑ i = 1 n X i − n μ n σ ≤ x } = 1 2 π ∫ − ∞ x e − t 2 2 d t = ∅ ( x ) \begin{equation}
\lim_{n\to\infty}F_n(x)
=\lim_{n\to\infty}P\left\{\frac{\sum_{i=1}^n{X_i}-n\mu}{\sqrt{n}\sigma}\leq{x}\right\}
=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^x{e^{-\frac{t^2}{2}}dt}
=\varnothing(x)
\end{equation} n → ∞ lim F n ( x ) = n → ∞ lim P { n σ ∑ i = 1 n X i − n μ ≤ x } = 2 π 1 ∫ − ∞ x e − 2 t 2 d t = ∅ ( x )
独立同分布的中心极限定理说明, 当 n n n 足够大时,
随机变量 X n = ∑ i = 1 n X i X_n=\sum\limits_{i=1}^n{X_i} X n = i = 1 ∑ n X i
近似服从正态分布 N ( n μ , n σ 2 ) N(n\mu,n\sigma^2) N ( n μ , n σ 2 ) ;
标准化后的随机变量 Y n = ∑ i = 1 n X i − n μ n σ Y_n=\frac{\sum_{i=1}^n{X_i}-n\mu}{\sqrt{n}\sigma} Y n = n σ ∑ i = 1 n X i − n μ
近似服从标准正态分布 N ( 0 , 1 ) N(0,1) N ( 0 , 1 ) .
更一般化的中心极限定理,
可参见林德伯格中心极限定理 (Lindeberg CLT )
etc.
∫ − ∞ ∞ e − x 2 d x = π \begin{equation}
\int_{-\infty}^{\infty}e^{-x^2}dx=\sqrt{\pi}
\end{equation} ∫ − ∞ ∞ e − x 2 d x = π
高维空间求解 高斯积分:
对于正态分布, 系数 1 π \frac{1}{\sqrt{\pi}} π 1 使得概率密度函数的积分为 1,
即 ∫ − ∞ ∞ f ( x ) d x = 1 \int_{-\infty}^{\infty}f(x)dx=1 ∫ − ∞ ∞ f ( x ) d x = 1 , 使其成为有意义的概率分布.
重复 n 次独立的伯努利试验, X ∼ B ( n , p ) X \sim B(n,p) X ∼ B ( n , p ) , 期望值 E ( X ) = n p E(X)=np E ( X ) = n p , 方差 D ( X ) = n p ( 1 − p ) D(X)=np(1-p) D ( X ) = n p ( 1 − p ) :
P ( X = k ) = C n k p k ( 1 − p ) n − k \begin{equation}
P(X=k)=C_n^kp^k(1-p)^{n-k}
\end{equation} P ( X = k ) = C n k p k ( 1 − p ) n − k
Bayes theorem :
P ( A ∩ B ) = P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) ⇒ P(A \cap B)=P(A|B)P(B)=P(B|A)P(A)\Rightarrow P ( A ∩ B ) = P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) ⇒
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) = P ( B ∣ A ) P ( A ) P ( B ∣ A ) P ( A ) + P ( B ∣ ¬ A ) P ( ¬ A ) \begin{equation}
P(A|B)=\frac{P(B|A)P(A)}{P(B)}=\frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\neg{A})P(\neg{A})}
\end{equation} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) P ( A ) = P ( B ∣ A ) P ( A ) + P ( B ∣¬ A ) P ( ¬ A ) P ( B ∣ A ) P ( A )
其中, P ( B ∣ A ) P ( B ∣ ¬ A ) \frac{P(B|A)}{P(B|\neg{A})} P ( B ∣¬ A ) P ( B ∣ A ) 称为贝叶斯系数 (Bayes Factor) :
O ( A ∣ B ) = P ( A ∣ B ) P ( ¬ A ∣ B ) = P ( A ∣ B ) P ( B ) P ( ¬ A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) P ( B ∣ ¬ A ) P ( ¬ A ) = O ( A ) P ( B ∣ A ) P ( B ∣ ¬ A ) O(A|B)=\frac{P(A|B)}{P(\neg{A}|B)}=\frac{P(A|B)P(B)}{P(\neg{A}|B)P(B)}=\frac{P(B|A)P(A)}{P(B|\neg{A})P(\neg{A})}=O(A)\frac{P(B|A)}{P(B|\neg{A})} O ( A ∣ B ) = P ( ¬ A ∣ B ) P ( A ∣ B ) = P ( ¬ A ∣ B ) P ( B ) P ( A ∣ B ) P ( B ) = P ( B ∣¬ A ) P ( ¬ A ) P ( B ∣ A ) P ( A ) = O ( A ) P ( B ∣¬ A ) P ( B ∣ A )
信息熵
是对信息量的度量 (E [ I ] E[I] E [ I ] ),
概率小的事件发生所带来的信息量大, 概率大的事件发生所带来的信息量小,
即概率小, 出现机会小, 不确定性大, 信息熵大, 信息量大:
H ( X ) = E [ − log 2 P ( x i ) ] = − ∑ i = 1 n P ( x i ) log 2 P ( x i ) \begin{equation}
H(X)=E[-\log_2{P(x_i)}]=-\sum\limits_{i=1}^n{P(x_i)\log_2{P(x_i)}}
\end{equation} H ( X ) = E [ − log 2 P ( x i ) ] = − i = 1 ∑ n P ( x i ) log 2 P ( x i )
Output a scalar:
Linear regression:
y = W x + b = ∑ i = 1 n w i x i + b y=Wx+b=\sum\limits_{i=1}^n{w_ix_i}+b y = W x + b = i = 1 ∑ n w i x i + b ,
L = ∑ i = 1 n ( y i − y ^ i ) 2 L=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2 L = i = 1 ∑ n ( y i − y ^ i ) 2 .
Polynomial regression:
y = ∑ i = 1 n w i x i + b y=\sum\limits_{i=1}^n{w_ix^i}+b y = i = 1 ∑ n w i x i + b .
Logistic regression (output probability):
y = σ ( W x + b ) = 1 1 + e − ∑ i = 1 n w i x i − b y=\sigma(Wx+b)=\frac{1}{1+e^{-\sum\limits_{i=1}^n{w_ix_i}-b}} y = σ ( W x + b ) = 1 + e − i = 1 ∑ n w i x i − b 1 ,
L = − ∑ i = 1 n y i log ( y ^ i ) L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)} L = − i = 1 ∑ n y i log ( y ^ i ) .
If model can't even fit training data,
then model have large bias (underfitting).
If model can fit training data but not testing data,
then model have large variance (overfitting).
To prevent underfitting, we can:
Add more features as input.
Use more complex and flexible model.
More complex model does not always lead to better performance
on testing data or new data.
Model Training Error Testing Error x x x 31.9 35.0 x 2 x^2 x 2 15.4 18.4 x 3 x^3 x 3 15.3 18.1 x 4 x^4 x 4 14.9 28.2 x 5 x^5 x 5 12.8 232.1
A extreme example,
such function obtains 0 0 0 training loss, but large testing loss:
f ( x ) = { y i , ∃ x i ∈ X random , otherwise \begin{align*}
f(x)=\begin{cases}
y_i, & \exists{x_i}\in{X} \\
\text{random}, & \text{otherwise}
\end{cases}
\end{align*} f ( x ) = { y i , random , ∃ x i ∈ X otherwise
To prevent overfitting, we can:
More training data.
Data augmentation: crop, flip, rotate, cutout, mixup.
Constrained model:
Less parameters, sharing parameters.
Less features.
Early stopping.
Dropout.
Regularization.
L ( w ) = ∑ i = 1 n ( y i − y ^ i ) 2 + λ ∑ i = 1 n w i 2 w t + 1 = w t − η ∇ L ( w ) = w t − η ( ∂ L ∂ w + λ w t ) = ( 1 − η λ ) w t − η ∂ L ∂ w ( Regularization: Weight Decay ) \begin{split}
L(w)&=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2+\lambda\sum\limits_{i=1}^n{w_i^2}\\
w_{t+1}&=w_t-\eta\nabla{L(w)}\\
&=w_t-\eta(\frac{\partial{L}}{\partial{w}}+\lambda{w_t})\\
&=(1-\eta\lambda)w_t-\eta\frac{\partial{L}}{\partial{w}}
\quad (\text{Regularization: Weight Decay})
\end{split} L ( w ) w t + 1 = i = 1 ∑ n ( y i − y ^ i ) 2 + λ i = 1 ∑ n w i 2 = w t − η ∇ L ( w ) = w t − η ( ∂ w ∂ L + λ w t ) = ( 1 − η λ ) w t − η ∂ w ∂ L ( Regularization: Weight Decay )
Binary classification:
y = δ ( W x + b ) y=\delta(Wx+b) y = δ ( W x + b ) ,
L = ∑ i = 1 n δ ( y i ≠ y ^ i ) L=\sum\limits_{i=1}^n\delta(y_i\ne\hat{y}_i) L = i = 1 ∑ n δ ( y i = y ^ i ) ,
e.g spam filtering.
Multi-class classification:
y = softmax ( W x + b ) y=\text{softmax}(Wx+b) y = softmax ( W x + b ) ,
L = − ∑ i = 1 n y i log ( y ^ i ) L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)} L = − i = 1 ∑ n y i log ( y ^ i ) ,
e.g document classification.
Non-linear model:
Deep learning: y = softmax ( ReLU ( W x + b ) ) y=\text{softmax}(\text{ReLU}(Wx+b)) y = softmax ( ReLU ( W x + b )) ,
e.g image recognition, game playing.
Support vector machine (SVM): y = sign ( W x + b ) y=\text{sign}(Wx+b) y = sign ( W x + b ) .
Decision tree: y = vote ( leaves ( x ) ) y=\text{vote}(\text{leaves}(x)) y = vote ( leaves ( x )) .
K-nearest neighbors (KNN): y = vote ( neighbors ( x ) ) y=\text{vote}(\text{neighbors}(x)) y = vote ( neighbors ( x )) .
Find a function F F F :
F : X × Y → R F:X\times{Y}\to{R} F : X × Y → R
F ( x , y ) F(x, y) F ( x , y ) evaluates how well y y y fits x x x (object compatible).
Given an object x x x :
y ~ = arg max y ∈ Y F ( x , y ) \tilde{y}=\arg\max\limits_{y\in{Y}}F(x, y) y ~ = arg y ∈ Y max F ( x , y )
Evaluation: what does F ( X , y ) F(X, y) F ( X , y ) look like.
Inference: how to solve arg max \arg\max arg max problem.
Training: how to find F ( x , y ) F(x, y) F ( x , y ) with given training data.
F ( x , y ) = ∑ i = 1 n w i ϕ i ( x , y ) = [ w 1 w 2 w 3 ⋮ w n ] ⋅ [ ϕ 1 ( x , y ) ϕ 2 ( x , y ) ϕ 3 ( x , y ) ⋮ ϕ n ( x , y ) ] = W ⋅ Φ ( x , y ) \begin{split}
F(x, y)&=\sum\limits_{i=1}^n{w_i\phi_i(x, y)} \\
&=\begin{bmatrix}w_1\\w_2\\w_3\\\vdots\\w_n\end{bmatrix}\cdot
\begin{bmatrix}\phi_1(x, y)\\\phi_2(x, y)\\\phi_3(x, y)\\\vdots\\\phi_n(x, y)\end{bmatrix}\\
&=W\cdot\Phi(x, y)
\end{split} F ( x , y ) = i = 1 ∑ n w i ϕ i ( x , y ) = w 1 w 2 w 3 ⋮ w n ⋅ ϕ 1 ( x , y ) ϕ 2 ( x , y ) ϕ 3 ( x , y ) ⋮ ϕ n ( x , y ) = W ⋅ Φ ( x , y )
主成分分析 (PCA) 是一种常用的数据降维方法, 将 m m m 个 n n n 维向量降为 k k k 维,
其目标是选择 k k k 个单位 (模为 1 1 1 ) 正交基, 使得原始数据变换到这组基上后,
各字段两两间协方差为 0 0 0 (各字段完全独立), 各字段的方差尽可能大 (各字段降维后分布尽可能分散),
即在正交的约束下, 取最大的 k k k 个方差:
C = 1 m X X T = 1 m [ ∑ i = 1 m ( x i 1 ) 2 ∑ i = 1 m x i 1 x i 2 … ∑ i = 1 m x i 1 x i n ∑ i = 1 m x i 2 x i 1 ∑ i = 1 m ( x i 2 ) 2 … ∑ i = 1 m x i 2 x i n ⋮ ⋮ ⋱ ⋮ ∑ i = 1 m x i n x i 1 ∑ i = 1 m x i n x i 2 … ∑ i = 1 m ( x i n ) 2 ] \begin{equation}
C=\frac{1}{m}XX^T=\frac{1}{m}\begin{bmatrix}
\sum_{i=1}^m(x_i^1)^2&\sum_{i=1}^m{x_i^1x_i^2}&\dots&\sum_{i=1}^m{x_i^1x_i^n}\\
\sum_{i=1}^m{x_i^2x_i^1}&\sum_{i=1}^m(x_i^2)^2&\dots&\sum_{i=1}^m{x_i^2x_i^n}\\
\vdots&\vdots&\ddots&\vdots\\
\sum_{i=1}^m{x_i^nx_i^1}&\sum_{i=1}^m{x_i^nx_i^2}&\dots&\sum_{i=1}^m(x_i^n)^2
\end{bmatrix}
\end{equation} C = m 1 X X T = m 1 ∑ i = 1 m ( x i 1 ) 2 ∑ i = 1 m x i 2 x i 1 ⋮ ∑ i = 1 m x i n x i 1 ∑ i = 1 m x i 1 x i 2 ∑ i = 1 m ( x i 2 ) 2 ⋮ ∑ i = 1 m x i n x i 2 … … ⋱ … ∑ i = 1 m x i 1 x i n ∑ i = 1 m x i 2 x i n ⋮ ∑ i = 1 m ( x i n ) 2
协方差矩阵 C C C 是一个对称矩阵, 其对角线分别为各字段的方差,
其第 i i i 行 j j j 列和第 j j j 行 i i i 列元素相同, 表示 i i i 和 j j j 两个字段的协方差.
将协方差矩阵对角化, 得到基于矩阵运算的 PCA 算法如下:
将原始数据按列组成 n n n 行 m m m 列矩阵 X X X .
将 X X X 的每一行 (代表一个属性字段) 进行零均值化, 即减去这一行的均值,
使得 x ˉ = 0 \bar{x}=0 x ˉ = 0 , 方便方差与协方差的计算.
求出协方差矩阵 C = 1 m X X T C=\frac{1}{m}XX^T C = m 1 X X T 的特征值及对应的特征向量.
将特征向量按对应特征值大小从上到下按行排列成矩阵, 取前 k k k 行组成矩阵 P P P .
Y = P X Y=PX Y = PX 即为降维到 k k k 维后的数据.
x i ′ = x i − μ σ x'_i=\frac{x_i-\mu}{\sigma} x i ′ = σ x i − μ
词嵌入是自然语言处理 (NLP) 中的一种技术,
将词汇映射到实数向量空间, 使得词汇之间的语义关系可以通过向量空间中的距离来表示.
变分自编码器 (VAEs) 是一种生成模型, 通过学习数据的潜在分布来生成新的数据:
Z = Encoder ( X ) X ′ = Decoder ( Z ) L = Min Loss ( X ′ , X ) \begin{split}
Z&=\text{Encoder}(X) \\
X'&=\text{Decoder}(Z) \\
L&=\text{Min Loss}(X',X)
\end{split} Z X ′ L = Encoder ( X ) = Decoder ( Z ) = Min Loss ( X ′ , X )
变分自动编码器学习的是隐变量 (特征) Z Z Z 的概率分布, z ∼ N ( 0 , I ) , x ∣ z ∼ N ( μ ( z ) , σ ( z ) ) z\sim N(0, I), x|z\sim N\big(\mu(z), \sigma(z)\big) z ∼ N ( 0 , I ) , x ∣ z ∼ N ( μ ( z ) , σ ( z ) ) ,
通过深度网络来学习 q ( z ∣ x ) q(z|x) q ( z ∣ x ) 的参数, 一步步优化 q q q 使其与 p ( z ∣ x ) p(z|x) p ( z ∣ x ) 十分相似, 便可用它来对复杂的分布进行近似的推理:
Feature disentangle:
voice conversion.
Discrete representation:
unsupervised classification, unsupervised summarization.
Anomaly detection:
face detection, fraud detection, disease detection, network intrusion detection.
Compression and decompression.
Generator.
生成对抗网络 (GANs) 由两个网络组成: 生成器 (Generator) 和判别器 (Discriminator).
生成器的目标是生成尽可能逼真的数据, 判别器的目标是尽可能准确地区分真实数据和生成数据.
两个网络相互对抗, 生成器生成数据 (decoder in VAE), 判别器判断数据真伪 (1 / 0 1/0 1/0 classification neural network),
生成器根据判别器的判断结果调整生成数据的策略, 不断提升生成数据的逼真程度.
G ∗ = arg min G max D V ( G , D ) D ∗ = arg max D V ( D , G ) \begin{split}
G^*&=\arg\min_G\max_DV(G,D)\\
D^*&=\arg\max_DV(D,G)
\end{split} G ∗ D ∗ = arg G min D max V ( G , D ) = arg D max V ( D , G )
Pre-trained models + fine-tuning (downstream tasks):
Cross lingual.
Cross discipline.
Pre-training with artificial data.
Long context window.
Content filtering: 去除有害内容.
Text extraction: 去除 HTML 标签.
Quality filtering: 去除低质量内容.
Document deduplication: 去除重复内容.
BERT (Bidirectional Encoder Representations from Transformers) 是一种 Encoder-only 预训练模型,
通过大规模无监督学习, 学习文本的语义信息, 用于下游任务的微调:
Masked token prediction: 随机遮挡输 入文本中的一些词, 预测被遮挡的词.
Next sentence prediction: 预测两个句子的顺序关系.
GPT (Generative Pre-trained Transformers) 是一种 Decoder-only 预训练模型.
低秩适配 (LoRA) 是一种参数高效微调技术 (Parameter-efficient Fine-tuning),
其基本思想是冻结原始矩阵 W 0 ∈ R H × H W_0\in\mathbb{R}^{H\times{H}} W 0 ∈ R H × H ,
通过低秩分解矩阵 A ∈ R H × R A\in\mathbb{R}^{H\times{R}} A ∈ R H × R 和 B ∈ R H × R B\in\mathbb{R}^{H\times{R}} B ∈ R H × R
来近似参数更新矩阵 Δ W = A ⋅ B T \Delta{W}=A\cdot{B^T} Δ W = A ⋅ B T ,
其中 R ≪ H R\ll{H} R ≪ H 是减小后的秩:
W = W 0 + Δ W = W 0 + A ⋅ B T \begin{equation}
W=W_0+\Delta{W}=W_0+A\cdot{B^T}
\end{equation} W = W 0 + Δ W = W 0 + A ⋅ B T
在微调期间, 原始的矩阵参数 W 0 W_0 W 0 不会被更新,
低秩分解矩阵 A A A 和 B B B 则是可训练参数用于适配下游任务.
LoRA 微调在保证模型效果的同时, 能够显著降低模型训练的成本.
Make model can understand human instructions not appear in training data:
提高指令复杂性和多样性能够促进模型性能的提升.
更大的参数规模有助于提升模型的指令遵循能力.
强化学习是一种机器学习方法, 通过智能体与环境交互,
智能体根据环境的反馈调整策略, 利用梯度上升算法 (Gradient Ascent),
最大化长期奖励 (learn from rewards and mistakes).
θ ∗ = arg max θ R ˉ θ = arg max θ ∑ τ R ( τ ) P ( τ ∣ θ ) θ t + 1 = θ t + η ∇ R ˉ θ ∇ R ˉ θ = [ ∂ R ˉ θ ∂ w 1 ∂ R ˉ θ ∂ w 2 ⋮ ∂ R ˉ θ ∂ b 1 ⋮ ] R t = ∑ n = t N γ n − t r n \begin{equation}
\begin{split}
\theta^*&=\arg\max\limits_\theta\bar{R}_\theta=\arg\max\limits_\theta\sum\limits_{\tau}R(\tau)P(\tau|\theta)\\
\theta_{t+1}&=\theta_t+\eta\nabla\bar{R}_\theta\\
\nabla\bar{R}_\theta&=\begin{bmatrix}\frac{\partial\bar{R}_\theta}{\partial{w_1}}\\\frac{\partial\bar{R}_\theta}{\partial{w_2}}\\\vdots\\\frac{\partial\bar{R}_\theta}{\partial{b_1}}\\\vdots\end{bmatrix}\\
R_t&=\sum\limits_{n=t}^N\gamma^{n-t}r_n
\end{split}
\end{equation} θ ∗ θ t + 1 ∇ R ˉ θ R t = arg θ max R ˉ θ = arg θ max τ ∑ R ( τ ) P ( τ ∣ θ ) = θ t + η ∇ R ˉ θ = ∂ w 1 ∂ R ˉ θ ∂ w 2 ∂ R ˉ θ ⋮ ∂ b 1 ∂ R ˉ θ ⋮ = n = t ∑ N γ n − t r n
多层感知机是一种前馈神经网络 (Feedforward Neural Network)
就像是一个模拟大脑处理信息的过程,
通过多层处理 (输入层, 隐藏层, 输出层),
从原始数据中提取特征, 并做出预测或分类,
它通过调整内部连接权重来学习和改进其预测能力.
线性变换 H l = W l X l − 1 + B l H^l=W^lX^{l-1}+B^l H l = W l X l − 1 + B l :
w i j l w_{ij}^l w ij l (weight
): 第 l l l 层第 i i i 个节点与上一层第 j j j 个节点连接的权重.
b i l b_i^l b i l (bias
): 第 l l l 层第 i i i 个节点的偏置.
H = [ w 00 l w 01 l … w 0 n l w 10 l w 11 l … w 1 n l ⋮ ⋮ ⋱ ⋮ w k 0 l w k 1 l … w k n l ] [ x 0 l − 1 x 1 l − 1 ⋮ x n l − 1 ] + [ b 0 l b 1 l ⋮ b k l ] H=\begin{bmatrix}
w_{00}^l & w_{01}^l & \dots & w_{0n}^l \\
w_{10}^l & w_{11}^l & \dots & w_{1n}^l \\
\vdots & \vdots & \ddots & \vdots \\
w_{k0}^l & w_{k1}^l & \dots & w_{kn}^l
\end{bmatrix}
\begin{bmatrix}
x_0^{l-1} \\
x_1^{l-1} \\
\vdots \\
x_n^{l-1}
\end{bmatrix}
+\begin{bmatrix}
b_0^l \\
b_1^l \\
\vdots \\
b_k^l
\end{bmatrix} H = w 00 l w 10 l ⋮ w k 0 l w 01 l w 11 l ⋮ w k 1 l … … ⋱ … w 0 n l w 1 n l ⋮ w kn l x 0 l − 1 x 1 l − 1 ⋮ x n l − 1 + b 0 l b 1 l ⋮ b k l
激活函数 y = σ ( H ) y=\sigma(H) y = σ ( H ) , X l = σ ( H l ) X^l=\sigma(H^l) X l = σ ( H l ) :
引入非线性特性, 使得网络可以学习和拟合复杂函数.
ReLU (Rectified Linear Unit, 线性整流单元): σ ( H ) = max ( 0 , H ) \sigma(H)=\max(0,H) σ ( H ) = max ( 0 , H ) ,
可以解决梯度消失问题 (越靠近输入层的神经元梯度越接近 0), 加速收敛.
Sigmoid: σ ( H ) = 1 1 + e − H \sigma(H)=\frac{1}{1+e^{-H}} σ ( H ) = 1 + e − H 1 .
e.g 归一化函数, 使得输出值在 0 到 1 之间, 可以使得 整个网络成为概率模型.
损失函数 L ( y , y ^ ) L(y,\hat{y}) L ( y , y ^ ) :
用于衡量真实值(或人工标注值) y y y 与模型预测值 y ^ \hat{y} y ^ 之间的差异.
常见的损失函数有均方误差 (Mean Squared Error, MSE) 和交叉熵 (Cross Entropy).
均方误差:
L ( y , y ^ ) = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 L(y,\hat{y})=\frac{1}{n}\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2 L ( y , y ^ ) = n 1 i = 1 ∑ n ( y i − y ^ i ) 2 .
交叉熵:
L ( y , y ^ ) = − ∑ i = 1 n y i log ( y ^ i ) L(y,\hat{y})=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)} L ( y , y ^ ) = − i = 1 ∑ n y i log ( y ^ i ) , 常用于 classification 任务.
Selective synaptic plasticity:
L ′ ( θ ) = L ( θ ) + λ ∑ i b i ( θ i ′ − θ i ) 2 L'(\theta)=L(\theta)+\lambda\sum\limits_ib_i(\theta_i'-\theta_i)^2 L ′ ( θ ) = L ( θ ) + λ i ∑ b i ( θ i ′ − θ i ) 2 .
∣ H ∣ |\mathcal{H}| ∣ H ∣ is the size of hypothesis space,
larger ∣ H ∣ |\mathcal{H}| ∣ H ∣ means deeper model:
Deep model need less neurons (parameters) to represent same function,
means deep model has smaller ∣ H ∣ |\mathcal{H}| ∣ H ∣ ,
flat/shallow model has larger ∣ H ∣ |\mathcal{H}| ∣ H ∣ .
Learning is the process of minimizing loss function,
finally find out the right weights and biases:
Early stopping.
Dropout.
Regularization.
New activation function.
Adaptive learning rate.
通过梯度下降算法 ,
优化损失函数, 使其最小化 (沿梯度下降方向, 调整 W 和 B):
θ t + 1 = θ t − η ∇ L \begin{equation}
\theta_{t+1}=\theta_t-\eta\nabla{L}
\end{equation} θ t + 1 = θ t − η ∇ L
其中, η \eta η 为学习率, L L L 为损失函数, ∇ L \nabla{L} ∇ L 为损失函数的梯度,
θ \theta θ 为模型参数, t t t 为迭代次数.
arg min L ( θ ) = L ( a , b ) + ∂ L ( a , b ) ∂ θ 1 ( θ 1 − a ) + ∂ L ( a , b ) ∂ θ 2 ( θ 2 − b ) + ⋯ + 1 n ! ∂ n L ( a , b ) ∂ θ 1 n ( θ 1 − a ) n + 1 n ! ∂ n L ( a , b ) ∂ θ 2 n ( θ 2 − b ) n ≈ L ( a , b ) + ∂ L ( a , b ) ∂ θ 1 ( θ 1 − a ) + ∂ L ( a , b ) ∂ θ 2 ( θ 2 − b ) ⇒ [ θ 1 − a θ 2 − b ] = − η [ ∂ L ( a , b ) ∂ θ 1 ∂ L ( a , b ) ∂ θ 2 ] ⇒ [ θ 1 θ 2 ] = [ a b ] − η [ ∂ L ( a , b ) ∂ θ 1 ∂ L ( a , b ) ∂ θ 2 ] \begin{split}
\arg\min{L(\theta)}&=L(a,b)+\frac{\partial{L(a,b)}}{\partial{\theta_1}}(\theta_1-a)+\frac{\partial{L(a,b)}}{\partial{\theta_2}}(\theta_2-b)\\
&+\dots+\frac{1}{n!}\frac{\partial^n{L(a,b)}}{\partial{\theta_1^n}}(\theta_1-a)^n+\frac{1}{n!}\frac{\partial^n{L(a,b)}}{\partial{\theta_2^n}}(\theta_2-b)^n\\
&\approx{L(a,b)}+\frac{\partial{L(a,b)}}{\partial{\theta_1}}(\theta_1-a)+\frac{\partial{L(a,b)}}{\partial{\theta_2}}(\theta_2-b)
\\
&\Rightarrow
\begin{bmatrix}\theta_1-a\\\theta_2-b\end{bmatrix}
=-\eta\begin{bmatrix}\frac{\partial{L(a,b)}}{\partial{\theta_1}}\\\frac{\partial{L(a,b)}}{\partial{\theta_2}}\end{bmatrix}
\\
&\Rightarrow
\begin{bmatrix}\theta_1 \\ \theta_2\end{bmatrix}
=\begin{bmatrix}a\\b\end{bmatrix}-\eta\begin{bmatrix}\frac{\partial{L(a,b)}}{\partial{\theta_1}}\\\frac{\partial{L(a,b)}}{\partial{\theta_2}}\end{bmatrix}
\end{split} arg min L ( θ ) = L ( a , b ) + ∂ θ 1 ∂ L ( a , b ) ( θ 1 − a ) + ∂ θ 2 ∂ L ( a , b ) ( θ 2 − b ) + ⋯ + n ! 1 ∂ θ 1 n ∂ n L ( a , b ) ( θ 1 − a ) n + n ! 1 ∂ θ 2 n ∂ n L ( a , b ) ( θ 2 − b ) n ≈ L ( a , b ) + ∂ θ 1 ∂ L ( a , b ) ( θ 1 − a ) + ∂ θ 2 ∂ L ( a , b ) ( θ 2 − b ) ⇒ [ θ 1 − a θ 2 − b ] = − η [ ∂ θ 1 ∂ L ( a , b ) ∂ θ 2 ∂ L ( a , b ) ] ⇒ [ θ 1 θ 2 ] = [ a b ] − η [ ∂ θ 1 ∂ L ( a , b ) ∂ θ 2 ∂ L ( a , b ) ]
def convex_function ( x ) : return x ** 2 def gradient_descent ( initial_x , learning_rate , num_iterations ) : x_values = [ initial_x ] y_values = [ convex_function ( initial_x ) ] x = initial_x for i in range ( num_iterations ) : gradient = 2 * x x -= learning_rate * gradient x_values . append ( x ) y_values . append ( convex_function ( x ) ) return x_values , y_values
一个优秀的梯度下降算法, 需要满足以下几个条件:
高效性:
梯度下降算法的迭代次数尽可能少.
当距离最小值较远时, ∇ L ≫ 0 \nabla{L}\gg0 ∇ L ≫ 0 ,
当距离最小值较近时, ∇ L → 0 \nabla{L}\to0 ∇ L → 0 ,
这样的梯度下降算法可以更快地收敛.
反之, 当距离最小值较远时, ∇ L → 0 \nabla{L}\to0 ∇ L → 0 , 这样的梯度下降算法更慢收敛.
稳定性:
梯度下降算法的迭代过程尽可能稳定.
Maxout 激活函数拟合能力非常强, 可以拟合任意的凸函数.
鲁棒性:
梯度下降算法对于初始值的选择
或者特定的线索片段
不敏感.
Dropout 策略
减少神经元之间复杂的共适应关系 (每个神经元有 p % p\% p % 概率不被激活),
迫使网络去学习更加鲁棒的特征, 缓解过拟合问题, 提高模型的泛化能力.
必要时需要调整学习率, 使得梯度下降更快收敛:
如果学习率过大, 可能会导致梯度下降不稳定, 甚至发散.
如果学习率过小, 可能会导致梯度下降收敛速度过慢.
常见的学习率调整策略有:
Learning rate decay:
阶梯衰减 (Step Decay): η t = η t + 1 \eta_t=\frac{\eta}{\sqrt{t+1}} η t = t + 1 η .
线性衰减 (Linear Decay): η t = η ( 1 − t T ) \eta_t=\eta(1-\frac{t}{T}) η t = η ( 1 − T t ) .
指数衰减 (Exponential Decay): η t = η e − k t \eta_t=\eta{e^{-kt}} η t = η e − k t .
余弦衰减 (Cosine Decay): η t = η 1 + cos ( π t T ) 2 \eta_t=\eta\frac{1+\cos(\frac{\pi{t}}{T})}{2} η t = η 2 1 + c o s ( T π t ) .
Warm up learning rate: increase and then decrease learning rate.
SGD with momentum :
w t + 1 = w t + v t + 1 = w t + λ v t − η g t w_{t+1}=w_t+v_{t+1}=w_t+\lambda{v_t}-\eta{g_t} w t + 1 = w t + v t + 1 = w t + λ v t − η g t .
Adaptive learning rate:
AdaGrad :
adaptive sub-gradient method,
w t + 1 = w t − η t + 1 1 t + 1 ∑ i = 0 t g i 2 g t = w t − η ∑ i = 0 t g i 2 g t w_{t+1}=w_t-\frac{\frac{\eta}{\sqrt{t+1}}}{\sqrt{\frac{1}{t+1}\sum_{i=0}^t{g_i^2}}}g_t=w_t-\frac{\eta}{\sqrt{\sum_{i=0}^t{g_i^2}}}g_t w t + 1 = w t − t + 1 1 ∑ i = 0 t g i 2 t + 1 η g t = w t − ∑ i = 0 t g i 2 η g t .
RMSprop :
root mean square propagation,
w t + 1 = w t − η σ t g t = w t − η α σ t − 1 2 + ( 1 − α ) g t 2 g t w_{t+1}=w_t-\frac{\eta}{\sigma_t}g_t=w_t-\frac{\eta}{\sqrt{\alpha\sigma_{t-1}^2}+(1-\alpha)g_t^2}g_t w t + 1 = w t − σ t η g t = w t − α σ t − 1 2 + ( 1 − α ) g t 2 η g t ,
Adam :
adaptive moment estimation (Momentum + RMSprop).
RAdam :
start with SGDM, then switch to Adam.
当遇到 ∇ L = 0 \nabla{L}=0 ∇ L = 0 的情况时, 可能是以下几种情况:
局部最小值 (Local Minimum).
局部最大值 (Local Maximum).
鞍点 (Saddle Point).
此时利用泰勒级数展开, 可以得到:
L ( θ ) ≈ L ( θ 0 ) + ∇ L ( θ 0 ) ( θ − θ 0 ) + 1 2 ( θ − θ 0 ) T ∇ 2 L ( θ 0 ) ( θ − θ 0 ) = L ( θ 0 ) + 1 2 ( θ − θ 0 ) T ∇ 2 L ( θ 0 ) ( θ − θ 0 ) \begin{split}
L(\theta)&\approx{L(\theta_0)}+\nabla{L(\theta_0)}(\theta-\theta_0)+\frac{1}{2}(\theta-\theta_0)^T\nabla^2{L(\theta_0)}(\theta-\theta_0)\\
&=L(\theta_0)+\frac{1}{2}(\theta-\theta_0)^T\nabla^2{L(\theta_0)}(\theta-\theta_0)
\end{split} L ( θ ) ≈ L ( θ 0 ) + ∇ L ( θ 0 ) ( θ − θ 0 ) + 2 1 ( θ − θ 0 ) T ∇ 2 L ( θ 0 ) ( θ − θ 0 ) = L ( θ 0 ) + 2 1 ( θ − θ 0 ) T ∇ 2 L ( θ 0 ) ( θ − θ 0 )
其中, ∇ 2 L ( θ 0 ) \nabla^2{L(\theta_0)} ∇ 2 L ( θ 0 ) 为 Hessian 矩阵, 为二阶导数矩阵:
当 ∇ 2 L ( θ 0 ) \nabla^2{L(\theta_0)} ∇ 2 L ( θ 0 ) 为正定矩阵时, θ 0 \theta_0 θ 0 为局部最小值.
当 ∇ 2 L ( θ 0 ) \nabla^2{L(\theta_0)} ∇ 2 L ( θ 0 ) 为负定矩阵时, θ 0 \theta_0 θ 0 为局部最大值.
当 ∇ 2 L ( θ 0 ) \nabla^2{L(\theta_0)} ∇ 2 L ( θ 0 ) 为不定矩阵时, θ 0 \theta_0 θ 0 为鞍点.
在高维空间中, 鞍点的数量远远多于局部最小值.
深度神经网络拥有大量的参数, 使得其损失函数的鞍点数量远远多于局部最小值.
反向传播算法 :
从最小化损失函数出发, 由输出层到输入层, 通过链式法则, 计算每一层的梯度, 从而更新权重和偏置.
Derivative chain rule (链式法则):
∂ L ∂ w i j l = ∂ L ∂ x i l ⋅ ∂ x i l ∂ z i l ⋅ ∂ z i l ∂ w i j l = ∂ L ∂ x i l ⋅ ∂ σ ( z i l ) ∂ z i l ⋅ ∂ ( X l − 1 W i l + b i l ) ∂ w i j l = δ i l ⋅ σ ′ ( z i l ) ⋅ x j l − 1 ∂ L ∂ b i l = ∂ L ∂ x i l ⋅ ∂ x i l ∂ z i l ⋅ ∂ z i l ∂ b i l = ∂ L ∂ x i l ⋅ ∂ σ ( z i l ) ∂ z i l ⋅ ∂ ( b i l + W i l X l − 1 ) ∂ b i l = δ i l ⋅ σ ′ ( z i l ) ∂ L ∂ x j l − 1 = ∑ i = 0 N l − 1 ∂ L ∂ x i l ⋅ ∂ x i l ∂ z i l ⋅ ∂ z i l ∂ x j l − 1 = ∑ i = 0 N l − 1 ∂ L ∂ x i l ⋅ ∂ σ ( z i l ) ∂ z i l ⋅ ∂ ( W i l X l − 1 + b i l ) ∂ x j l − 1 = ∑ i = 0 N l − 1 δ i l ⋅ σ ′ ( z i l ) ⋅ w i j l \begin{split}
\frac{\partial{L}}{\partial{w_{ij}^l}}
&=\frac{\partial{L}}{\partial{x_i^l}}\cdot
\frac{\partial{x_i^l}}{\partial{z_i^l}}\cdot
\frac{\partial{z_i^l}}{\partial{w_{ij}^l}} \\
&=\frac{\partial{L}}{\partial{x_i^l}}\cdot
\frac{\partial{\sigma(z_i^l)}}{\partial{z_i^l}}\cdot
\frac{\partial(X^{l-1}W_i^l+b_i^l)}{\partial{w_{ij}^l}} \\
&=\delta_i^l\cdot\sigma'(z_i^l)\cdot{x_j^{l-1}} \\
\frac{\partial{L}}{\partial{b_i^l}}
&=\frac{\partial{L}}{\partial{x_i^l}}\cdot
\frac{\partial{x_i^l}}{\partial{z_i^l}}\cdot
\frac{\partial{z_i^l}}{\partial{b_i^l}} \\
&=\frac{\partial{L}}{\partial{x_i^l}}\cdot
\frac{\partial{\sigma(z_i^l)}}{\partial{z_i^l}}\cdot
\frac{\partial(b_i^l+W_i^lX^{l-1})}{\partial{b_i^l}} \\
&=\delta_i^l\cdot\sigma'(z_i^l) \\
\frac{\partial{L}}{\partial{x_j^{l-1}}}
&=\sum\limits_{i=0}^{N_l-1}\frac{\partial{L}}{\partial{x_i^l}}\cdot
\frac{\partial{x_i^l}}{\partial{z_i^l}}\cdot
\frac{\partial{z_i^l}}{\partial{x_j^{l-1}}} \\
&=\sum\limits_{i=0}^{N_l-1}\frac{\partial{L}}{\partial{x_i^l}}\cdot
\frac{\partial{\sigma(z_i^l)}}{\partial{z_i^l}}\cdot
\frac{\partial(W_i^lX^{l-1}+b_i^l)}{\partial{x_j^{l-1}}} \\
&=\sum\limits_{i=0}^{N_l-1}\delta_i^l\cdot\sigma'(z_i^l)\cdot{w_{ij}^l}
\end{split} ∂ w ij l ∂ L ∂ b i l ∂ L ∂ x j l − 1 ∂ L = ∂ x i l ∂ L ⋅ ∂ z i l ∂ x i l ⋅ ∂ w ij l ∂ z i l = ∂ x i l ∂ L ⋅ ∂ z i l ∂ σ ( z i l ) ⋅ ∂ w ij l ∂ ( X l − 1 W i l + b i l ) = δ i l ⋅ σ ′ ( z i l ) ⋅ x j l − 1 = ∂ x i l ∂ L ⋅ ∂ z i l ∂ x i l ⋅ ∂ b i l ∂ z i l = ∂ x i l ∂ L ⋅ ∂ z i l ∂ σ ( z i l ) ⋅ ∂ b i l ∂ ( b i l + W i l X l − 1 ) = δ i l ⋅ σ ′ ( z i l ) = i = 0 ∑ N l − 1 ∂ x i l ∂ L ⋅ ∂ z i l