关于机器学习:逻辑回归

60次阅读

共计 3647 个字符，预计需要花费 10 分钟才能阅读完成。

$h(x)=W^TX +b =\theta^Tx$

$$
h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2\tag{1}
$$

$$
h(x)=\sum_{i=0}^n\theta_ix_i=\theta^Tx\tag{2}
$$

$$x_0 =1 $$

$$
J(\theta)=\frac 1 2\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\tag{3}
$$

式（3）也称 Ordinary least squares(最小二乘法)，和 mean square error(MSE 均方误差) 很类似，区别在于：

最小二乘法作为损失函数没有除以总样本数 m，均方误差除以样本总数 m
基于均方误差最小化来进行模型求解的办法称为“最小二乘法”。—周志华《机器学习》

至于 linear regression loss function 为什么是 Ordinary least square，详见下方链接 pdf 里的 Probabilistic interpretation.

Ordinary least squares

Gradient descent
Normal equation

$$
\theta=(X^TX)^{-1}X^T\vec{y}\tag{4}
$$

详见 https://see.stanford.edu/mate…

$$
g(z)=\frac {1} {1+e^{-z}}\tag{5}
$$

$$
g'(z)=g(z)(1-g(z))\tag{6}
$$

如上图 Fig.1 所示，sigmoid 函数值域为 0 -1, 定义域为 R, 导数 x = 0 时，其值最大为 0.25，但当 x 趋向于无穷小或无穷大时有梯度隐没或梯度爆炸的危险。

$$
h_\theta(x)=g(\theta^Tx)=\frac {1} {1+e^{-\theta^Tx}}\tag{7}
$$

把线性回归输入输出到逻辑函数里就变成了逻辑回归，值域从 R 变为 0 -1, 以此就能够作二分类工作，大于 0.5 是一类，小于 0.5 是一类。

接下来从最大似然预计的角度来寻找 loss function。
个别 by MLE 咱们想得到 argmax p(D$ |\theta) $, 即找到参数使失去这样的数据成为可能。

$$
P(y=1|x;\theta)=h_\theta(x)\tag{8}
$$

$$
P(y=0|x;\theta)=1-h_\theta(x)\tag{9}
$$

公式（8）和公式（9）能够合成公式（10）：

$$
P(y|x;\theta)=(h_\theta(x))^y(1-h_\theta(x))^{1-y}\tag{10}
$$

公式（10）为一个样本的似然概率，也称点估计，那么所以样本的似然概率怎么示意呢？见下式（11）：

$$
\begin{equation}\begin{split}
L(\theta)&=p(\vec{y}|X;\theta)\\
&=\prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta)\\
&=\prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}}\tag{11}
\end{split}\end{equation}
$$

实际上这就是咱们的指标函数，但想找到 $arg\,max{L(\theta)}$ 咱们还须要进一步简化，简化的办法是后面加一个 log，也就是 $log(L(\theta)) $, 这样做的益处是：

简化方程，将连乘转换成连加
避免计算过程中产生数值下溢的危险
log 是枯燥增函数，不会扭转原函数性质

$$
\begin{equation}\begin{split}
\ell(\theta)&=log(L(\theta))\\
&=\sum_{i=1}^{m}y^{(i)}logh_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))\tag{12}
\end{split}\end{equation}
$$

式（12）就是咱们寻找的最大似然函数，其实后面加一个负号就变成了 Loss function，只不过最大似然函数谋求的是最大值，而 Loss function 谋求的是最小值。

为了不便求导没有把 sum 退出计算，也即只针对一个样本

$$
\begin{equation}\begin{split}
\frac{\partial \ell(\theta)}{\partial \theta_j}&=(y\frac{1}{g(\theta^Tx)}-(1-y)\frac{1}{1-g(\theta^Tx)})\frac{\partial g(\theta^Tx)}{\theta_j}\\
&=(y\frac{1}{g(\theta^Tx)}-(1-y)\frac{1}{1-g(\theta^Tx)})g(\theta^Tx)(1-g(\theta^Tx))\frac{\partial \theta^Tx}{\theta_j}\\
&=(y(1-g(\theta^Tx))-(1-y)g(\theta^Tx))x_j\\
&=(y-h_\theta(x))x_j\tag{13}
\end{split}\end{equation}
$$

so, gradient ascent rule:

$$
\theta_j:=\theta_j+\alpha(y^{(i)}-h_\theta(x)^{(i)})x_j^{(i)}\tag{14}
$$

下面咱们提到，式（12）取负号就是咱们的 loss function：

$$
\arg\,min{-\sum_{i=1}^{m}y^{(i)}logh_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))\tag{15}}
$$

And gradient descent rule for all samples:

$$
\theta_j:=\theta_j-\alpha\sum_{i=1}^{m}(y^{(i)}-h_\theta(x)^{(i)})x_j^{(i)}\tag{16}
$$

As we can see in the figure above, we penalize wrong predictions with an increasingly larger cost.

Gradient descent code

def gradient_descent(X, Y, nx, ny, m, num_iterations, alpha):
    """Gradient descent to train parameters."""
    W = np.zeros(shape=(nx, 1), dtype=np.float32) # weights initialization
    b = 0.0 # bias initialization
    for i in range(num_iterations):
        Z  = np.dot(W.T, X) + b # shape: (1, m) 
        A = sigmoid(Z) # shape: (1, m)

        if i % 1000 == 0:
            # two strategies are both ok
            # mean square error
            cost_1 = -1.0 / m * (np.dot(Y, np.log(A).T) + np.dot((1-Y), np.log(1-A).T))
            print('cost_1:{}'.format(np.squeeze(cost_1)))
            cost_2 = -1.0 / m * np.sum(np.multiply(Y, np.log(A)) + np.multiply(1-Y, np.log(1-A)))
            print('cost_2:{}'.format(cost_2))

        # computation graph
        dZ = A - Y # The derivative of cost to A to Z. shape: (1, m)
        dW = np.dot(X, dZ.T) # The derivative of cost to A to Z to W. shape: (nx, 1)
        W -= 1.0 / m * alpha * dW # update W
        db =  np.sum(dZ) # The derivative of cost to A to Z to b
        b -= 1.0 / m * alpha * db # update b
    return W, b

正文完

机器学习

发表至：机器学习

2022-02-14

0

关于机器学习:千卡利用率超98详解JuiceFS在权威AI测试中的实现策略

关于机器学习:ModelOps技术应用及趋势白皮书正式发布

关于机器学习:38节一个小视频带你认识下这群才华与美貌兼具的宝藏女孩

关于机器学习:MindsDB写SQL就能建模的数据库

关于ios:智汀云盘开发指南iOS文件夹加密逻辑

关于机器学习:逻辑回归

Pre learn

Objective function, cost function, loss function: are they the same thing?

The meaning of theta

Linear regression

Hypotheses:

Loss:

Goal:

Solutions:

Logistic regression

Logistic function:

Hypotheses:

MLE(Maximum Likelihood Estimate)

Assume：

Likelihood：

Log Likelihood

Gradient ascent for one sample:

Loss function

Just My Socks（注册教程内含优惠码）

关于机器学习:逻辑回归

Pre learn

Objective function, cost function, loss function: are they the same thing?

The meaning of theta

Linear regression

Hypotheses:

Loss:

Goal:

Solutions:

Logistic regression

Logistic function:

Hypotheses:

MLE(Maximum Likelihood Estimate)

Assume：

Likelihood：

Log Likelihood

Gradient ascent for one sample:

Loss function

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）