Pre learn
Objective function, cost function, loss function: are they the same thing?
The meaning of theta
\( h(x)=W^TX +b =\theta^Tx\)
Linear regression
Hypotheses:
$$h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2\tag{1}$$
$$h(x)=\sum_{i=0}^n\theta_ix_i=\theta^Tx\tag{2}$$
$$x_0 =1 $$
Loss:
$$J(\theta)=\frac 1 2\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\tag{3}$$
式(3)也称Ordinary least squares(最小二乘法),和mean square error(MSE 均方误差)很类似,区别在于:
- 最小二乘法作为损失函数没有除以总样本数m,均方误差除以样本总数m
- 基于均方误差最小化来进行模型求解的办法称为“最小二乘法”。—周志华《机器学习》
至于linear regression loss function为什么是Ordinary least square,详见下方链接pdf里的Probabilistic interpretation.
Goal:
- Ordinary least squares
Solutions:
- Gradient descent
- Normal equation
$$\theta=(X^TX)^{-1}X^T\vec{y}\tag{4}$$
详见https://see.stanford.edu/mate...
Logistic regression
Logistic function:
$$g(z)=\frac {1} {1+e^{-z}}\tag{5}$$
$$g'(z)=g(z)(1-g(z))\tag{6}$$
如上图Fig.1所示,sigmoid函数值域为0-1,定义域为R,导数x=0时,其值最大为0.25,但当x趋向于无穷小或无穷大时有梯度隐没或梯度爆炸的危险。
Hypotheses:
$$h_\theta(x)=g(\theta^Tx)=\frac {1} {1+e^{-\theta^Tx}}\tag{7}$$
把线性回归输入输出到逻辑函数里就变成了逻辑回归,值域从R变为0-1,以此就能够作二分类工作,大于0.5是一类,小于0.5是一类。
MLE(Maximum Likelihood Estimate)
接下来从最大似然预计的角度来寻找loss function。
个别by MLE咱们想得到argmax p(D\( |\theta) \),即找到参数使失去这样的数据成为可能。
Assume:
$$P(y=1|x;\theta)=h_\theta(x)\tag{8}$$
$$P(y=0|x;\theta)=1-h_\theta(x)\tag{9}$$
公式(8)和公式(9)能够合成公式(10):
$$P(y|x;\theta)=(h_\theta(x))^y(1-h_\theta(x))^{1-y}\tag{10}$$
公式(10)为一个样本的似然概率,也称点估计,那么所以样本的似然概率怎么示意呢?见下式(11):
Likelihood:
$$\begin{equation}\begin{split} L(\theta)&=p(\vec{y}|X;\theta)\\ &=\prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta)\\&=\prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}}\tag{11}\end{split}\end{equation}$$
实际上这就是咱们的指标函数,但想找到\( arg\,max{L(\theta)}\)咱们还须要进一步简化,简化的办法是后面加一个log,也就是\( log(L(\theta)) \),这样做的益处是:
- 简化方程,将连乘转换成连加
- 避免计算过程中产生数值下溢的危险
- log是枯燥增函数,不会扭转原函数性质
Log Likelihood
$$\begin{equation}\begin{split} \ell(\theta)&=log(L(\theta))\\ &=\sum_{i=1}^{m}y^{(i)}logh_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))\tag{12}\end{split}\end{equation}$$
式(12)就是咱们寻找的最大似然函数,其实后面加一个负号就变成了Loss function,只不过最大似然函数谋求的是最大值,而Loss function谋求的是最小值。
Gradient ascent for one sample:
为了不便求导没有把sum退出计算,也即只针对一个样本
$$\begin{equation}\begin{split} \frac{\partial \ell(\theta)}{\partial \theta_j}&=(y\frac{1}{g(\theta^Tx)}-(1-y)\frac{1}{1-g(\theta^Tx)})\frac{\partial g(\theta^Tx)}{\theta_j}\\&=(y\frac{1}{g(\theta^Tx)}-(1-y)\frac{1}{1-g(\theta^Tx)})g(\theta^Tx)(1-g(\theta^Tx))\frac{\partial \theta^Tx}{\theta_j}\\&=(y(1-g(\theta^Tx))-(1-y)g(\theta^Tx))x_j\\&=(y-h_\theta(x))x_j\tag{13}\end{split}\end{equation}$$
so, gradient ascent rule:
$$\theta_j:=\theta_j+\alpha(y^{(i)}-h_\theta(x)^{(i)})x_j^{(i)}\tag{14}$$
Loss function
下面咱们提到,式(12)取负号就是咱们的loss function:
$$\arg\,min{-\sum_{i=1}^{m}y^{(i)}logh_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))\tag{15}}$$
And gradient descent rule for all samples:
$$\theta_j:=\theta_j-\alpha\sum_{i=1}^{m}(y^{(i)}-h_\theta(x)^{(i)})x_j^{(i)}\tag{16}$$
As we can see in the figure above, we penalize wrong predictions with an increasingly larger cost.
Gradient descent code
def gradient_descent(X, Y, nx, ny, m, num_iterations, alpha): """ Gradient descent to train parameters. """ W = np.zeros(shape=(nx, 1), dtype=np.float32) # weights initialization b = 0.0 # bias initialization for i in range(num_iterations): Z = np.dot(W.T, X) + b # shape: (1, m) A = sigmoid(Z) # shape: (1, m) if i % 1000 == 0: # two strategies are both ok # mean square error cost_1 = -1.0 / m * (np.dot(Y, np.log(A).T) + np.dot((1-Y), np.log(1-A).T)) print('cost_1:{}'.format(np.squeeze(cost_1))) cost_2 = -1.0 / m * np.sum(np.multiply(Y, np.log(A)) + np.multiply(1-Y, np.log(1-A))) print('cost_2:{}'.format(cost_2)) # computation graph dZ = A - Y # The derivative of cost to A to Z. shape: (1, m) dW = np.dot(X, dZ.T) # The derivative of cost to A to Z to W. shape: (nx, 1) W -= 1.0 / m * alpha * dW # update W db = np.sum(dZ) # The derivative of cost to A to Z to b b -= 1.0 / m * alpha * db # update b return W, b