共计 9990 个字符,预计需要花费 25 分钟才能阅读完成。
浏览本文须要的背景知识点:对数几率回归算法(一)、共轭梯度法、一点点编程常识
一、引言
接上一篇对数几率回归算法(一),其中介绍了优化对数几率回归代价函数的两种办法——梯度降落法(Gradient descent)与牛顿法(Newton’s method)。但当应用一些第三方机器学习库时会发现,个别都不会简略的间接应用上述两种办法,而是用的是一些优化版本或是算法的变体。例如后面介绍的在 scikit-learn 中可选的求解器如下表所示:
求解器 /solver | 算法 |
---|---|
sag | 随机均匀梯度降落法(Stochastic Average Gradient/SAG) |
saga | 随机均匀梯度降落减速法(SAGA) |
lbfgs | L-BFGS 算法(Limited-memory Broyden–Fletcher–Goldfarb–Shanno/L-BFGS) |
newton-cg | 牛顿 - 共轭梯度法(Newton-Conjugate Gradient) |
上面就来一一介绍上述的这些算法,为什么个别第三方库中不间接梯度降落法与牛顿法,这两个原始算法存在什么缺点?因为笔者能力无限,上面算法只给出了迭代公式,其迭代公式的起源无奈在此具体推导进去,感兴趣的读者可参考对应论文中的证实。
二、梯度降落法
梯度降落法(Gradient Descent/GD)
梯度降落原始算法,也被称为批量梯度降落法(Batch gradient descent/BGD),将整个数据集作为输出来计算梯度。
$$
w=w-\eta \nabla_{w} \operatorname{Cost}(w)
$$
该算法的次要毛病是应用了整个数据集,当数据集很大的时候,计算梯度时可能会异样的耗时。
随机梯度降落法(Stochastic Gradient Descent/SGD)
每次迭代更新只随机的解决某一个数据,而不是整个数据集。
$$
w=w-\eta \nabla_{w} \operatorname{Cost}\left(w, X_{i}, y_{i}\right) \quad i \in[1, N]
$$
该算法因为是随机一个数据点,代价函数并不是始终降落,而是会上下稳定,调整步长使得代价函数的后果整体呈降落趋势,所以收敛速率没有批量梯度降落快。
小批量梯度降落法(Mini-batch Gradient Descent/MBGD)
小批量梯度降落法联合了下面两种算法,在计算梯度是既不是应用整个数据集,也不是每次随机选其中一个数据,而是一次应用一部分数据来更新。
$$
w=w-\eta \nabla_{w} \operatorname{Cost}\left(w, X_{(i: i+k)}, y_{(i: i+k)}\right) \quad i \in[1, N]
$$
随机均匀梯度降落法 1(Stochastic Average Gradient/SAG)
随机均匀梯度降落法是对随机梯度降落法的优化,因为 SGD 的随机性,导致其收敛速度较迟缓。SAG 则是通过记录上一次地位的梯度记录,使得可能看到更多的信息。
$$
\begin{aligned}
w_{k+1} &=w_{k}-\frac{\eta}{N} \sum_{i=1}^{N}\left(d_{i}\right)_{k} \\
d_{k} &=\left\{\begin{array}{ll}
\nabla_{w} \operatorname{Cost}\left(w_{k}, X_{i}, y_{i}\right) & i=i_{k} \\
d_{k-1} & i \neq i_{k}
\end{array}\right.
\end{aligned}
$$
方差缩减随机梯度降落法 2(Stochastic Variance Reduced Gradient/SVRG)
方差缩减随机梯度降落法是对随机梯度降落法的另一种优化,因为 SGD 的收敛问题是因为梯度的方差假如有一个常数的上界,SVRG 的做法是通过减小这个方差来使得收敛过程更加稳固。
$$
w_{k+1}=w_{k}-\eta\left(\nabla_{w} \operatorname{Cost}\left(w_{k}, X_{i}, y_{i}\right)-\nabla_{w} \operatorname{Cost}\left(\hat{w}, X_{i}, y_{i}\right)+\frac{1}{N} \sum_{j=1}^{N} \nabla_{w} \operatorname{Cost}\left(\hat{w}, X_{j}, y_{j}\right)\right)
$$
随机均匀梯度降落法变体 3(SAGA)
SAGA 是对随机均匀梯度降落法的优化,联合了方差缩减随机梯度降落法的办法。
$$
w_{k+1}=w_{k}-\eta\left(\nabla_{w} \operatorname{Cost}\left(w_{k}, X_{i}, y_{i}\right)-\nabla_{w} \operatorname{Cost}\left(w_{k-1}, X_{i}, y_{i}\right)+\frac{1}{N} \sum_{j=1}^{N} \nabla_{w} \operatorname{Cost}\left(w, X_{j}, y_{j}\right)\right)
$$
三、牛顿法
牛顿法(Newton Method)
牛顿法原始版本,将整个数据集作为输出来计算出梯度和黑塞矩阵后求出降落的方向
$$
w_{k+1}=w_{k}-\eta\left(H^{-1} \nabla_{w} \operatorname{Cost}\left(w_{k}\right)\right)
$$
该算法的次要毛病是需要求黑塞矩阵和它的逆矩阵,当 x 的维度过多的时候,求黑塞矩阵的过程会异样的艰难。
DFP 法 4(Davidon–Fletcher–Powell)
DFP 法是一个拟牛顿法,算法如下:
$$
\begin{aligned}
g_{k} &=\nabla_{w} \operatorname{Cost}\left(w_{k}\right) \\
d_{k} &=-D_{k} g_{k} \\
s_{k} &=\eta d_{k} \\
w_{k+1} &=w_{k}+s_{k} \\
g_{k+1} &=\nabla_{w} \operatorname{Cost}\left(w_{k+1}\right) \\
y_{k} &=g_{k+1}-g_{k} \\
D_{k+1} &=D_{k}+\frac{s_{k} s_{k}^{T}}{s_{k}^{T} y_{k}}-\frac{D_{k} y_{k} y_{k}^{T} D_{K}}{y_{k}^{T} D_{k} y_{k}}
\end{aligned}
$$
能够看到 DFP 不再间接求黑塞矩阵,而是通过一次一次的迭代来失去近似值,其中 D 为黑塞矩阵的逆矩阵的近似。
BFGS 法 5(Broyden–Fletcher–Goldfarb–Shanno)
BFGS 法同样是一个拟牛顿法,根本步骤与 DFP 法截然不同,算法如下:
$$
\begin{aligned}
g_{k} &=\nabla_{w} \operatorname{Cost}\left(w_{k}\right) \\
d_{k} &=-D_{k} g_{k} \\
s_{k} &=\eta d_{k} \\
w_{k+1} &=w_{k}+s_{k+1} \\
g_{k+1} &=\nabla_{w} \operatorname{Cost}\left(w_{k+1}\right) \\
y_{k} &=g_{k+1}-g_{k} \\
D_{k+1} &=\left(I-\frac{s_{k} y_{k}^{T}}{y_{k}^{T} s_{k}}\right) D_{k}\left(I-\frac{y_{k} s_{k}^{T}}{y_{k}^{T} s_{k}}\right)+\frac{s_{k} s_{k}^{T}}{y_{k}^{T} s_{k}}
\end{aligned}
$$
能够看到 BFGS 法的惟一区别只是对黑塞矩阵的近似办法不同,其中 D 为黑塞矩阵的逆矩阵的近似。
L-BFGS 法 6(Limited-memory Broyden–Fletcher–Goldfarb–Shanno)
因为 BFGS 法须要存储一个近似的黑塞矩阵,当 x 的维度过多的时候,这个黑塞矩阵的占用内存会异样的大,L-BFGS 法令是对 BFGS 法再一次的近似,算法如下:
$$
\begin{aligned}
g_{k} &=\nabla_{w} \operatorname{Cost}\left(w_{k}\right) \\
d_{k} &=-\operatorname{calcDirection}\left(s_{k-m: k-1}, y_{k-m: k-1}, \rho_{k-m: k-1}, g_{k}\right) \\
s_{k} &=\eta d_{k} \\
w_{k+1} &=w_{k}+s_{k} \\
g_{k+1} &=\nabla_{w} \operatorname{Cost}\left(w_{k+1}\right) \\
y_{k} &=g_{k+1}-g_{k} \\
\rho_{k} &=\frac{1}{y_{k}^{T} s_{k}}
\end{aligned}
$$
能够看到 L -BFGS 法不再间接保留这个近似的黑塞矩阵,而是当要用到时间接通过一组向量计算出来,达到节俭内存的目标。计算方向的办法可参考上面代码中的实现。
牛顿共轭梯度法(Newton-Conjugate Gradient/Newton-CG)
牛顿共轭梯度法是对牛顿法的优化,算法如下:
$$
\begin{aligned}
g_{k} &=\nabla_{w} \operatorname{Cost}\left(w_{k}\right) \\
H_{k} &=\nabla_{w}^{2} \operatorname{Cost}\left(w_{k}\right) \\
\Delta w &=c g\left(g_{k}, H_{k}\right) \\
w_{k+1} &=w_{k}-\Delta w
\end{aligned}
$$
能够看到牛顿共轭梯度法不再求黑塞矩阵的逆矩阵,而是通过共轭梯度法(Conjugate Gradient)间接求出 Δw。对于共轭梯度法举荐看参考文献中的文章 7,具体介绍了该算法的原理与利用。
四、代码实现
应用 Python 实现对数几率回归算法(随机梯度降落法):
import numpy as np
def logisticRegressionSGD(X, y, max_iter=100, tol=1e-4, step=1e-1):
w = np.zeros(X.shape[1])
xy = np.c_[X.reshape(X.shape[0], -1), y.reshape(X.shape[0], 1)]
for it in range(max_iter):
s = step / (np.sqrt(it + 1))
np.random.shuffle(xy)
X_new, y_new = xy[:, :-1], xy[:, -1:].ravel()
for i in range(0, X.shape[0]):
d = dcost(X_new[i], y_new[i], w)
if (np.linalg.norm(d) <= tol):
return w
w = w - s * d
return w
应用 Python 实现对数几率回归算法(批量随机梯度降落法):
import numpy as np
def logisticRegressionMBGD(X, y, batch_size=50, max_iter=100, tol=1e-4, step=1e-1):
w = np.zeros(X.shape[1])
xy = np.c_[X.reshape(X.shape[0], -1), y.reshape(X.shape[0], 1)]
for it in range(max_iter):
s = step / (np.sqrt(it + 1))
np.random.shuffle(xy)
for start in range(0, X.shape[0], batch_size):
stop = start + batch_size
X_batch, y_batch = xy[start:stop, :-1], xy[start:stop, -1:].ravel()
d = dcost(X_batch, y_batch, w)
if (np.linalg.norm(p_avg) <= tol):
return w
w = w - s * d
return w
应用 Python 实现对数几率回归算法(随机均匀梯度降落法):
import numpy as np
def logisticRegressionSAG(X, y, max_iter=100, tol=1e-4, step=1e-1):
w = np.zeros(X.shape[1])
p = np.zeros(X.shape[1])
d_prev = np.zeros(X.shape)
for it in range(max_iter):
s = step / (np.sqrt(it + 1))
for it in range(X.shape[0]):
i = np.random.randint(0, X.shape[0])
d = dcost(X[i], y[i], w)
p = p - d_prev[i] + d
d_prev[i] = d
p_avg = p / X.shape[0]
if (np.linalg.norm(p_avg) <= tol):
return w
w = w - s * p_avg
return w
应用 Python 实现对数几率回归算法(方差缩减随机梯度降落法):
import numpy as np
def logisticRegressionSVRG(X, y, max_iter=100, m = 100, tol=1e-4, step=1e-1):
w = np.zeros(X.shape[1])
for it in range(max_iter):
s = step / (np.sqrt(it + 1))
g = np.zeros(X.shape[1])
for i in range(X.shape[0]):
g = g + dcost(X[i], y[i], w)
g = g / X.shape[0]
tempw = w
for it in range(m):
i = np.random.randint(0, X.shape[0])
d_tempw = dcost(X[i], y[i], tempw)
d_w = dcost(X[i], y[i], w)
d = d_tempw - d_w + g
if (np.linalg.norm(d) <= tol):
break
tempw = tempw - s * d
w = tempw
return w
应用 Python 实现对数几率回归算法(SAGA):
import numpy as np
def logisticRegressionSAGA(X, y, max_iter=100, tol=1e-4, step=1e-1):
w = np.zeros(X.shape[1])
p = np.zeros(X.shape[1])
d_prev = np.zeros(X.shape)
for i in range(X.shape[0]):
d_prev[i] = dcost(X[i], y[i], w)
for it in range(max_iter):
s = step / (np.sqrt(it + 1))
for it in range(X.shape[0]):
i = np.random.randint(0, X.shape[0])
d = dcost(X[i], y[i], w)
p = d - d_prev[i] + np.mean(d_prev, axis=0)
d_prev[i] = d
if (np.linalg.norm(p) <= tol):
return w
w = w - s * p
return w
应用 Python 实现对数几率回归算法(DFP):
import numpy as np
def logisticRegressionDPF(X, y, max_iter=100, tol=1e-4):
w = np.zeros(X.shape[1])
D_k = np.eye(X.shape[1])
g_k = dcost(X, y, w)
for it in range(max_iter):
d_k = -D_k.dot(g_k)
s = lineSearch(X, y, w, d_k, 0, 10)
s_k = s * d_k
w = w + s_k
g_k_1 = dcost(X, y, w)
if (np.linalg.norm(g_k_1) <= tol):
return w
y_k = (g_k_1 - g_k).reshape(-1, 1)
s_k = s_k.reshape(-1, 1)
D_k = D_k + s_k.dot(s_k.T) / s_k.T.dot(y_k) - D_k.dot(y_k).dot(y_k.T).dot(D_k) / y_k.T.dot(D_k).dot(y_k)
g_k = g_k_1
return w
应用 Python 实现对数几率回归算法(BFGS):
import numpy as np
def logisticRegressionBFGS(X, y, max_iter=100, tol=1e-4):
w = np.zeros(X.shape[1])
D_k = np.eye(X.shape[1])
g_k = dcost(X, y, w)
for it in range(max_iter):
d_k = -D_k.dot(g_k)
s = lineSearch(X, y, w, d_k, 0, 10)
s_k = s * d_k
w = w + s_k
g_k_1 = dcost(X, y, w)
if (np.linalg.norm(g_k_1) <= tol):
return w
y_k = (g_k_1 - g_k).reshape(-1, 1)
s_k = s_k.reshape(-1, 1)
a = s_k.dot(y_k.T)
b = y_k.T.dot(s_k)
c = s_k.dot(s_k.T)
D_k = (np.eye(X.shape[1]) - a / b).dot(D_k).dot((np.eye(X.shape[1]) - a.T / b)) + c / b
g_k = g_k_1
return w
应用 Python 实现对数几率回归算法(L-BFGS):
import numpy as np
def calcDirection(ss, ys, rhos, g_k, m, k):
delta = 0
L = k
q = g_k.reshape(-1, 1)
if k > m:
delta = k - m
L = m
alphas = np.zeros(L)
for i in range(L - 1, -1, -1):
j = i + delta
alpha = rhos[j].dot(ss[j].T).dot(q)
alphas[i] = alpha
q = q - alpha * ys[j]
r = np.eye(g_k.shape[0]).dot(q)
for i in range(0, L):
j = i + delta
beta = rhos[j].dot(ys[j].T).dot(r)
r = r + (alphas[i] - beta) * ss[j]
return -r.ravel()
def logisticRegressionLBFGS(X, y, m=100, max_iter=100, tol=1e-4):
w = np.zeros(X.shape[1])
g_k = dcost(X, y, w)
d_k = -np.eye(X.shape[1]).dot(g_k)
ss = []
ys = []
rhos = []
for it in range(max_iter):
d_k = calcDirection(ss, ys, rhos, g_k, m, it)
s = lineSearch(X, y, w, d_k, 0, 1)
s_k = s * d_k
w = w + s_k
g_k_1 = dcost(X, y, w)
if (np.linalg.norm(g_k_1) <= tol):
return w
y_k = (g_k_1 - g_k).reshape(-1, 1)
s_k = s_k.reshape(-1, 1)
ss.append(s_k)
ys.append(y_k)
rhos.append(1 / (y_k.T.dot(s_k)))
g_k = g_k_1
return w
应用 Python 实现对数几率回归算法(牛顿共轭梯度法):
import numpy as np
def cg(H, g, max_iter=100, tol=1e-4):
"""
共轭梯度法
H * deltaw = g
"""
deltaw = np.zeros(g.shape[0])
i = 0
r = g
d = r
delta = np.dot(r, r)
delta_0 = delta
while i < max_iter:
q = H.dot(d)
alpha = delta / (np.dot(d, q))
deltaw = deltaw + alpha * d
r = r - alpha * q
delta_prev = delta
delta = np.dot(r, r)
if delta <= tol * tol * delta_0:
break
beta = delta / delta_prev
d = r + beta * d
i = i + 1
return deltaw
def logisticRegressionNewtonCG(X, y, max_iter=100, tol=1e-4, step = 1.0):
"""
对数几率回归,应用牛顿共轭梯度法(Newton-Conjugate Gradient)args:
X - 训练数据集
y - 指标标签值
max_iter - 最大迭代次数
tol - 变动量容忍值
return:
w - 权重系数
"""
# 初始化 w 为零向量
w = np.zeros(X.shape[1])
# 开始迭代
for it in range(max_iter):
# 计算梯度
d = dcost(X, y, w)
# 当梯度足够小时,完结迭代
if np.linalg.norm(d) <= tol:
break
# 计算黑塞矩阵
H = ddcost(X, y, w)
# 应用共轭梯度法计算 Δw
deltaw = cg(H, d)
w = w - step * deltaw
return w
五、第三方库实现
scikit-learn8 实现对数几率回归(随机均匀梯度降落法):
from sklearn.linear_model import LogisticRegression
# 初始化对数几率回归器,无正则化
reg = LogisticRegression(penalty="none", solver="sag")
# 拟合线性模型
reg.fit(X, y)
# 权重系数
w = reg.coef_
# 截距
b = reg.intercept_
scikit-learn8 实现对数几率回归(SAGA):
from sklearn.linear_model import LogisticRegression
# 初始化对数几率回归器,无正则化
reg = LogisticRegression(penalty="none", solver="saga")
# 拟合线性模型
reg.fit(X, y)
# 权重系数
w = reg.coef_
# 截距
b = reg.intercept_
scikit-learn8 实现对数几率回归(L-BFGS):
from sklearn.linear_model import LogisticRegression
# 初始化对数几率回归器,无正则化
reg = LogisticRegression(penalty="none", solver="lbfgs")
# 拟合线性模型
reg.fit(X, y)
# 权重系数
w = reg.coef_
# 截距
b = reg.intercept_
scikit-learn8 实现对数几率回归(牛顿共轭梯度法):
from sklearn.linear_model import LogisticRegression
# 初始化对数几率回归器,无正则化
reg = LogisticRegression(penalty="none", solver="newton-cg")
# 拟合线性模型
reg.fit(X, y)
# 权重系数
w = reg.coef_
# 截距
b = reg.intercept_
七、思维导图
八、参考文献
- https://hal.inria.fr/hal-0086…
- https://harkiratbehl.github.i…
- https://arxiv.org/pdf/1407.02…
- https://en.wikipedia.org/wiki…
- https://en.wikipedia.org/wiki…
- https://en.wikipedia.org/wiki…
- https://flat2010.github.io/20…
- https://scikit-learn.org/stab…
残缺演示请点击这里
注:本文力求精确并通俗易懂,但因为笔者也是初学者,程度无限,如文中存在谬误或脱漏之处,恳请读者通过留言的形式批评指正
本文首发于——AI 导图 ,欢送关注