关于人工智能:使用python手写MetropolisHastings算法的贝叶斯线性回归

在学习贝叶斯计算的解马尔可夫链蒙特卡洛(MCMC)模仿时，最简略的办法是应用PyMC3，构建模型，调用Metropolis优化器。然而应用他人的包咱们并不真正了解产生了什么，所以本文通过手写Metropolis-Hastings来深刻的了解MCMC的过程，再次强调咱们本人实现该办法并不是并不是为了造轮子，而是为了更好的通过代码了解该概念。

贝叶斯线性回归蕴含了几十个概念和定义，这使得咱们的整个钻研成为一种折磨，并且真正产生的事件。在本文中，我将通过常见Metropolis-Hastings 算法构建一个马尔可夫链，并提供一个理论的应用案例。咱们将着重于推断简略线性回归模型的参数（然而这里说“简略”并不能代表它背地的原理简略）。咱们还能够用任何其余条件散布(泊松/伽马/负二项或其余)替换正态分布，这样能够通过MCMC实现简直雷同的GLM（只更改4或5行代码）。因为推断的样本来自参数的后验散布，咱们能够很天然地应用这些来构建参数的置信区间(在这种状况下，“可信区间”更正确)。

上面咱们将简要形容为什么应用MCMC办法，提供一个线性回归模型的MH算法的实现，并将以一个可视化的形式显示当算法寻找生成数据的参数集时，真正产生了什么。

数据筹备

设Y和X别离为模型的响应和输出。线性模型表明，给定输出的响应的条件散布是正态的。也就是:

对于适合的参数a(斜率)、b(偏差)和(噪声强度)。

咱们的工作是推断a, b和。

所以咱们首先要晓得一些模型须要遵循的“根本规定”。在这个例子中，a, b简直能够是任何数值，正的或负的，但必须是严格正的(因为素来没有据说过负标准差的正态分布，对吧?)除此之外，没有其余任何规定。兴许你会说：“咱们须要先理解这些是如何散布的”，然而后验散布的渐近正态性保障通知咱们，只有有足够的后验样本，这些样本无论如何都是正态分布(如果马尔可夫链达到其安稳散布)，所以散布不是咱们思考的必要因素。

当初让咱们为回归生成合成数据，这里应用参数a=3, b=20和=5。能够通过以下代码在python中实现:

import numpy as npimport matplotlib.pyplot as pltimport scipy.stats as sc# sample xnp.random.seed(2022)x = np.random.rand(100)*30# set parametersa = 3b = 20sigma = 5# obtain response and add noisey = a*x+bnoise = np.random.randn(100)*sigma# create a matrix containing the predictor in the first column# and the response in the seconddata = np.vstack((x,y)).T + noise.reshape(-1,1)# plot data plt.scatter(data[:,0], data[:,1])plt.xlabel("x")plt.ylabel("y")

合成数据如下图所示：

数据的筹备曾经实现了，下一节将波及定义 Metropolis Hastings 算法的函数和一组迭代次数的循环。

算法介绍

假如=[a,b，]是算法下面的参数向量， '是一组新参数的倡议，MH比拟参数( '和)的两个竞争假如之间的贝叶斯因子(似然和先验的乘积)，并通过条件倡议散布的倒数缩放该因子。而后将该因子与均匀分布的随机变量的值进行比拟。这给模型减少了随机性，使不可能的参数向量有可能被摸索，也可能被抛弃(很少)。

这听起来有点简单，让咱们从头一步一步对它进行代码的实现。

Proposal Distribution

首先，咱们定义了一个倡议的散布（Proposal Distribution）g(′|):这是前一个工夫步固定参数的散布。在咱们的例子中，a和b能够是正的也能够是负的，因而一个天然的抉择就是从一个以前一个迭代步骤为核心的多元正态分布中采样它们。咱们能够从伽马散布中取样，这些散布的定义咱们能够依据理论状况进行抉择，然而一个更好的办法(这里咱们将不波及)是从逆伽马散布抽样。

因而，咱们能够依照以下形式定义进行Proposal Distribution:

散布抽样为

参数 k 用于管制散布在其均值四周的“扩大”。是对gamma 额定调整。k越大，gamma 的散布越大（咱们应用的是 gamma 散布的形态速率公式因为 scipy 强制应用形态比例公式）。性能如下：

def proposal(prec_theta, search_width = 0.5):    # this function generates the proposal for the new theta    # we assume that the distribution of the random variables     # is normal for the first two and gamma for the third.    # conditional on the previous value of the accepted parameters (prec_theta)    out_theta = np.zeros(3)    out_theta[:2] = sc.multivariate_normal(mean=prec_theta[:2],cov=np.eye(2)*search_width**2).rvs(1)    #the last component is the noise    out_theta[2] = sc.gamma(a=prec_theta[2]*search_width*500, scale=1/(500*search_width)).rvs()    return out_theta

咱们可能会能够留神到函数proposal蕴含两个参数:prec_theta示意下面说的的参数向量，search_width示意寻找倡议的区域。寻找一组良好的参数会在摸索(在未摸索的区域中搜寻新参数集)和开发(在已找到良好参数集的区域中改良搜寻)之间进行衡量。所以应该十分小心地设置search_width。search_width值过大会使MCMC收敛到安稳散布。一个小的值可能会阻止算法在正当的工夫内找到最优(optima)(须要绘制更多的样本，更多的训练工夫)。

似然函数

似然函数其实就是线性函数，并且给定参数的响应的条件散布是正态的。换句话说，咱们将计算正态分布的可能性，其中均值是输出和系数a和b的乘积，噪声是。在这种状况下，咱们将应用对数似然而不是原始似然，这样能够进步稳定性。

def lhd(x,theta):    # x is the data matrix, first column for input and second column for output.    # theta is a vector containing the parameters for the evaluation    # remember theta[0] is a, theta[1] is b and theta[2] is sigma    xs = x[:,0]    ys = x[:,1]    lhd_out = sc.norm.logpdf(ys, loc=theta[0]*xs+theta[1], scale=theta[2])    # then we sum lhd_out (be careful here, we are summing instead of multiplying    # because we are dealing with the log-likelihood, instead of the raw likelihood).    lhd_out = np.sum(lhd_out)    return lhd_out

先验

咱们不须要在这方面花很多工夫。在先验散布方面，能够抉择任何喜爱的货色。这里须要做的就是确保可能找到推断的参数的区域具备非零先验，并且噪声参数是非负的。这里的先验以 log-pdf 的模式示意。

def prior(theta):    # evaluate the prior for the parameters on a multivariate gaussian.     prior_out = sc.multivariate_normal.logpdf(theta[:2],mean=np.array([0,0]), cov=np.eye(2)*100)    # this needs to be summed to the prior for the sigma, since I assumed independence.    prior_out += sc.gamma.logpdf(theta[2], a=1, scale=1)    return prior_out

Proposal Ratio

Proposal Ratio是在给定新参数向量的状况下察看到旧参数向量的可能性除以给定旧参数向量的状况下察看到新参数向量的概率。也就是Proposal Distribution提到的，g(|′)/g(′|)。这里将应用log-pdf，这样能够在概率中具备对立的尺度，并取得更好的数值稳定性。

def proposal_ratio(theta_old, theta_new, search_width=10):    # this is the proposal distribution ratio    # first, we calculate of the pdf of the proposal distribution at the old value of theta with respect to the new     # value of theta. And then we do the exact opposite.    prop_ratio_out = sc.multivariate_normal.logpdf(theta_old[:2],mean=theta_new[:2], cov=np.eye(2)*search_width**2)    prop_ratio_out += sc.gamma.logpdf(theta_old[2], a=theta_new[2]*walk*500, scale=1/(500*walk))    prop_ratio_out -= sc.multivariate_normal.logpdf(theta_new[:2],mean=theta_old[:2], cov=np.eye(2)*search_width**2)    prop_ratio_out -= sc.gamma.logpdf(theta_new[2], a=theta_old[2]*walk*500, scale=1/(500*walk))    return prop_ratio_out

整合

把所有信息整合起来。伪代码如下:

1)实例化参数向量的初始值

．.． N次，直到收敛

2)从倡议散布中找到一个新的参数向量

3)计算似然、先验pdf值和倡议似然比的倒数

4)将3中的所有数量相乘(或log求和)，并比拟这个比例(线性比例)

依据从均匀分布中得出的数字。

5)如果比值较大，则承受新的参数向量。

否则，新的参数向量将被回绝。

6)反复2

在Python代码如下:

np.random.seed(100)width = 0.2thetas = np.random.rand(3).reshape(1,-1)accepted = 0rejected = 0N = 200000for i in range(N):    # 1) provide a proposal for theta    theta_new = proposal(thetas[-1,:], search_width=width)        # 2) calculate the likelihood of this proposal and the likelihood    # for the old value of theta    log_lik_theta_new = lhd(data,theta_new)    log_lik_theta = lhd(data,thetas[-1,:])        # 3) evaluate the prior log-pdf at the current and at the old value of theta    theta_new_prior = prior(theta_new)    theta_prior = prior(thetas[-1,:])        # 4) finally, we need the proposal distribution ratio    prop_ratio = proposal_ratio(thetas[-1], theta_new, search_width=width)        # 5) assemble likelihood, priors and proposal distributions    likelihood_prior_proposal_ratio = log_lik_theta_new - log_lik_theta + \                             theta_new_prior - theta_prior + prop_ratio        # 6) throw a - possibly infinitely weigthed - coin. The move for Metropolis-Hastings is    # not deterministic. Here, we exponentiate the likelihood_prior_proposal ratio in order    # to obtain the probability in linear scale    if np.exp(likelihood_prior_proposal_ratio) > sc.uniform().rvs():        thetas = np.vstack((thetas,theta_new))        accepted += 1    else:        rejected += 1

下面的代码可能会要花费一些工夫来运行。批改N来取得更少的采样，或编辑宽度以放慢摸索。

后果展现

后果蕴含在下图中

失去的：a(红色), b(蓝色) and (紫色)

参数的直方图。

能够看到，样本的历史后果十分稳固。平均值和标准偏差是：

然而有一个问题，咱们只在这里提一下：这些样本高度相干，因而在预计可信区间时可能须要小心。这里的一种解决方案是通过只保留一小部分参数来细化历史记录(例如，只保留1 / 10已承受的提议，并抛弃其余的)。

传统的线性回归相比如何呢？

应用 statsmodels.api.OLS 执行完全相同的计算

import statsmodels.api as smimport pandas as pddf = pd.DataFrame(data)df.columns = ["a","y"]df["b"] = 1lm = sm.OLS(df["y"], df[["a","b"]]).fit()lm.summary()

后果如下：

a 和 b 的系数的平均值十分靠近 MCMC。标准误差也能够这样说，这样也进一步证实了咱们对 MCMC 实现是可行的。

请留神，这不是最好的解决方案，而只是一个解决方案。因为的确存在并举荐更好的先验和倡议散布的抉择。

迭代的可视化？

在 3D 中可视化相当的凌乱，所以这里只关注斜率 a 和偏差 b。

在可视化中，每 10 步显示一次 MH 的行为：

红点代表以后的倡议提案，红点四周的灰色区域示意与平均值相差 3 个标准差的倡议散布（正态）。该提案有一个对角协方差矩阵，这就是咱们失去一个圆而不是椭圆。

蓝色线代表被回绝的动作。

红线代表承受的动作

最下面浅蓝点示意从 statsmodels.api.OLS 取得的参数的平均值。

https://avoid.overfit.cn/post/fd35419dff344202af60885ce263e3cc

作者：Fortunato Nucera