关于机器学习:高斯朴素贝叶斯分类的原理解释和手写代码实现

43次阅读

共计 7671 个字符,预计需要花费 20 分钟才能阅读完成。

Gaussian Naive Bayes (GNB) 是一种基于概率办法和高斯分布的机器学习的分类技术。奢侈贝叶斯假如每个参数(也称为特色或预测变量)具备预测输入变量的独立能力。所有参数的预测组合是最终预测,它返回因变量被分类到每个组中的概率,最初的分类被调配给概率较高的分组(类)。

什么是高斯分布?

高斯分布也称为正态分布,是形容自然界中间断随机变量的统计散布的统计模型。正态分布由其钟形曲线定义,正态分布中两个最重要的特色是均值 (μ) 和标准差 (σ)。平均值是散布的平均值,标准差是散布在平均值四周的“宽度”。

重要的是要晓得正态分布的变量 (X) 从 -∞ < X < +∞ 间断散布(连续变量),并且模型曲线下的总面积为 1。

多分类的高斯奢侈贝叶斯

导入必要的库:

from random import random
from random import randint
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statistics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from mlxtend.plotting import plot_decision_regions

当初创立一个预测变量呈正态分布的数据集。

#Creating values for FeNO with 3 classes:
FeNO_0 = np.random.normal(20, 19, 200)
FeNO_1 = np.random.normal(40, 20, 200)
FeNO_2 = np.random.normal(60, 20, 200)

#Creating values for FEV1 with 3 classes:
FEV1_0 = np.random.normal(4.65, 1, 200)
FEV1_1 = np.random.normal(3.75, 1.2, 200)
FEV1_2 = np.random.normal(2.85, 1.2, 200)

#Creating values for Broncho Dilation with 3 classes:
BD_0 = np.random.normal(150,49, 200)
BD_1 = np.random.normal(201,50, 200)
BD_2 = np.random.normal(251, 50, 200)

#Creating labels variable with three classes:(2)disease (1)possible disease (0)no disease:
not_asthma = np.zeros((200,), dtype=int)
poss_asthma = np.ones((200,), dtype=int)
asthma = np.full((200,), 2, dtype=int)

#Concatenate classes into one variable:
FeNO = np.concatenate([FeNO_0, FeNO_1, FeNO_2])
FEV1 = np.concatenate([FEV1_0, FEV1_1, FEV1_2])
BD = np.concatenate([BD_0, BD_1, BD_2])
dx = np.concatenate([not_asthma, poss_asthma, asthma])

#Create DataFrame:
df = pd.DataFrame()

#Add variables to DataFrame:
df['FeNO'] = FeNO.tolist()
df['FEV1'] = FEV1.tolist()
df['BD'] = BD.tolist()
df['dx'] = dx.tolist()

#Check database:
df

咱们的 df 有 600 行和 4 列。当初咱们能够通过可视化查看变量的散布:

fig, axs = plt.subplots(2, 3, figsize=(14, 7))

sns.kdeplot(df['FEV1'], shade=True, color="b", ax=axs[0, 0])
sns.kdeplot(df['FeNO'], shade=True, color="b", ax=axs[0, 1])
sns.kdeplot(df['BD'], shade=True, color="b", ax=axs[0, 2])
sns.distplot(a=df["FEV1"], hist=True, kde=True, rug=False, ax=axs[1, 0])
sns.distplot(a=df["FeNO"], hist=True, kde=True, rug=False, ax=axs[1, 1])
sns.distplot(a=df["BD"], hist=True, kde=True, rug=False, ax=axs[1, 2])

plt.show()

通过人肉的查看,数据仿佛靠近高斯分布。还能够应用 qq-plots 仔细检查:

from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot

#q-q plot:
fig, axs = pyplot.subplots(1, 3, figsize=(15, 5))
qqplot(df['FEV1'], line='s', ax=axs[0])
qqplot(df['FeNO'], line='s', ax=axs[1])
qqplot(df['BD'], line='s', ax=axs[2])
pyplot.show()

尽管不是完满的正态分布,但曾经很靠近了。上面查看的数据集和变量之间的相关性:

#Exploring dataset:
sns.pairplot(df, kind="scatter", hue="dx")
plt.show()

能够应用框线图查看这三组的散布,看看哪些特色能够更好的辨别出类别

# plotting both distibutions on the same figure
fig, axs = plt.subplots(2, 3, figsize=(14, 7))

fig = sns.kdeplot(df['FEV1'], hue= df['dx'], shade=True, color="r", ax=axs[0, 0])
fig = sns.kdeplot(df['FeNO'], hue= df['dx'], shade=True, color="r", ax=axs[0, 1])
fig = sns.kdeplot(df['BD'], hue= df['dx'], shade=True, color="r", ax=axs[0, 2])
sns.boxplot(x=df["dx"], y=df["FEV1"], palette = 'magma', ax=axs[1, 0])
sns.boxplot(x=df["dx"], y=df["FeNO"], palette = 'magma',ax=axs[1, 1])
sns.boxplot(x=df["dx"], y=df["BD"], palette = 'magma',ax=axs[1, 2])

plt.show()

手写奢侈贝叶斯分类

手写代码并不是让咱们反复的制作轮子,而是通过本人编写代码对算法更好的了解。在进行贝叶斯分类之前,先要理解正态分布。

正态分布的数学公式定义了一个观测值呈现在某个群体中的概率:

咱们能够创立一个函数来计算这个概率:

def normal_dist(x , mean , sd):
      prob_density = (1/sd*np.sqrt(2*np.pi)) * np.exp(-0.5*((x-mean)/sd)**2)
      return prob_density

晓得正态分布公式,就能够计算该样本在三个分组(分类)概率。首先,须要计算所有预测特色和组的均值和标准差:

#Group 0:
group_0 = df[df['dx'] == 0]print('Mean FEV1 group 0:', statistics.mean(group_0['FEV1']))
print('SD FEV1 group 0:', statistics.stdev(group_0['FEV1']))
print('Mean FeNO group 0:', statistics.mean(group_0['FeNO']))
print('SD FeNO group 0:', statistics.stdev(group_0['FeNO']))
print('Mean BD group 0:', statistics.mean(group_0['BD']))
print('SD BD group 0:', statistics.stdev(group_0['BD']))

#Group 1:
group_1 = df[df['dx'] == 1]
print('Mean FEV1 group 1:', statistics.mean(group_1['FEV1']))
print('SD FEV1 group 1:', statistics.stdev(group_1['FEV1']))
print('Mean FeNO group 1:', statistics.mean(group_1['FeNO']))
print('SD FeNO group 1:', statistics.stdev(group_1['FeNO']))
print('Mean BD group 1:', statistics.mean(group_1['BD']))
print('SD BD group 1:', statistics.stdev(group_1['BD']))

#Group 2:
group_2 = df[df['dx'] == 2]
print('Mean FEV1 group 2:', statistics.mean(group_2['FEV1']))
print('SD FEV1 group 2:', statistics.stdev(group_2['FEV1']))
print('Mean FeNO group 2:', statistics.mean(group_2['FeNO']))
print('SD FeNO group 2:', statistics.stdev(group_2['FeNO']))
print('Mean BD group 2:', statistics.mean(group_2['BD']))
print('SD BD group 2:', statistics.stdev(group_2['BD']))

当初,应用一个随机的样本进行测试:FEV1 = 2.75FeNO = 27BD = 125

#Probability for:
#FEV1 = 2.75
#FeNO = 27
#BD = 125

#We have the same number of observations, so the general probability is: 0.33
Prob_geral = round(0.333, 3)

#Prob FEV1:
Prob_FEV1_0 = round(normal_dist(2.75, 4.70, 1.08), 10)
print('Prob FEV1 0:', Prob_FEV1_0)
Prob_FEV1_1 = round(normal_dist(2.75, 3.70, 1.13), 10)
print('Prob FEV1 1:', Prob_FEV1_1)
Prob_FEV1_2 = round(normal_dist(2.75, 3.01, 1.22), 10)
print('Prob FEV1 2:', Prob_FEV1_2)

#Prob FeNO:
Prob_FeNO_0 = round(normal_dist(27, 19.71, 19.29), 10)
print('Prob FeNO 0:', Prob_FeNO_0)
Prob_FeNO_1 = round(normal_dist(27, 42.34, 19.85), 10)
print('Prob FeNO 1:', Prob_FeNO_1)
Prob_FeNO_2 = round(normal_dist(27, 61.78, 21.39), 10)
print('Prob FeNO 2:', Prob_FeNO_2)

#Prob BD:
Prob_BD_0 = round(normal_dist(125, 152.59, 50.33), 10)
print('Prob BD 0:', Prob_BD_0)
Prob_BD_1 = round(normal_dist(125, 199.14, 50.81), 10)
print('Prob BD 1:', Prob_BD_1)
Prob_BD_2 = round(normal_dist(125, 256.13, 47.04), 10)
print('Prob BD 2:', Prob_BD_2)

#Compute probability:
Prob_group_0 = Prob_geral*Prob_FEV1_0*Prob_FeNO_0*Prob_BD_0
print('Prob group 0:', Prob_group_0)

Prob_group_1 = Prob_geral*Prob_FEV1_1*Prob_FeNO_1*Prob_BD_1
print('Prob group 1:', Prob_group_1)

Prob_group_2 = Prob_geral*Prob_FEV1_2*Prob_FeNO_2*Prob_BD_2
print('Prob group 2:', Prob_group_2)

能够看到,这个样本具备属于第 2 组的概率最高。这就是奢侈贝叶斯手动计算的的流程,然而这种成熟的算法能够应用来自 Scikit-Learn 的更高效的实现。

Scikit-Learn 的分类器样例

Scikit-Learn 的 GaussianNB 为咱们提供了更加高效的办法,上面咱们应用 GaussianNB 进行残缺的分类实例。首先创立 X 和 y 变量,并执行训练和测试拆分:

#Creating X and y:
X = df.drop('dx', axis=1)
y = df['dx']

#Data split into train and test:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

在输出之前还须要应用 standardscaler 对数据进行标准化:

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

当初构建和评估模型:

#Build the model:
classifier = GaussianNB()
classifier.fit(X_train, y_train)

#Evaluate the model:
print("training set score: %f" % classifier.score(X_train, y_train))
print("test set score: %f" % classifier.score(X_test, y_test))

上面应用混同矩阵来可视化后果:

# Predicting the Test set results
y_pred = classifier.predict(X_test)

#Confusion Matrix:
cm = confusion_matrix(y_test, y_pred)
print(cm)

通过混同矩阵能够看到,的模型最适宜预测类别 0,但类别 1 和 2 的错误率很高。为了查看这个问题,咱们应用变量构建决策边界图:

df.to_csv('data.csv', index = False)
data = pd.read_csv('data.csv')
def gaussian_nb_a(data):
    x = data[['BD','FeNO',]].values
    y = data['dx'].astype(int).values
    Gauss_nb = GaussianNB()
    Gauss_nb.fit(x,y)
    print(Gauss_nb.score(x,y))
    #Plot decision region:
    plot_decision_regions(x,y, clf=Gauss_nb, legend=1)
    #Adding axes annotations:
    plt.xlabel('X_train')
    plt.ylabel('y_train')
    plt.title('Gaussian Naive Bayes')
    plt.show()
def gaussian_nb_b(data):
    x = data[['BD','FEV1',]].values
    y = data['dx'].astype(int).values 
    Gauss_nb = GaussianNB()
    Gauss_nb.fit(x,y)
    print(Gauss_nb.score(x,y))
    #Plot decision region:
    plot_decision_regions(x,y, clf=Gauss_nb, legend=1)
    #Adding axes annotations:
    plt.xlabel('X_train')
    plt.ylabel('y_train')
    plt.title('Gaussian Naive Bayes') 
    plt.show()
def gaussian_nb_c(data):
    x = data[['FEV1','FeNO',]].values
    y = data['dx'].astype(int).values
    Gauss_nb = GaussianNB()
    Gauss_nb.fit(x,y)
    print(Gauss_nb.score(x,y))
    #Plot decision region:
    plot_decision_regions(x,y, clf=Gauss_nb, legend=1)
    #Adding axes annotations:  
    plt.xlabel('X_train')
    plt.ylabel('y_train')  
    plt.title('Gaussian Naive Bayes')
    plt.show()
gaussian_nb_a(data)
gaussian_nb_b(data)
gaussian_nb_c(data)

通过决策边界咱们能够察看到分类谬误的起因,从图中咱们看到,很多点都是落在决策边界之外的,如果是理论数据咱们须要剖析具体起因,然而因为是测试数据所以咱们也不须要更多的剖析。

https://www.overfit.cn/post/0457f85f2c184ff0864db5256654aef1

正文完
 0