关于人工智能:余弦相似度算法进行客户流失分类预测

余弦相似性是一种用于计算两个向量之间类似度的办法，常被用于文本分类和信息检索畛域。具体来说，假如有两个向量 A 和 B，它们的余弦类似度能够通过以下公式计算：

其中，dot_product(A, B) 示意向量 A 和 B 的点积，norm(A) 和 norm(B) 别离示意向量 A 和 B 的范数。如果 A 和 B 越类似，它们的余弦类似度就越靠近 1，反之亦然。

咱们这里用的演示数据集来自一个 datacamp：

这个数据集来自一家伊朗电信公司，每一行代表一个客户一年的工夫。除了客户散失标签，还有客户流动的信息，比方呼叫失败和订阅时长等等。咱们最初要预测的是这个客户是否散失，也就是一个二元分类的问题。

数据集如下：

 import pandas as pd
 df = pd.read_csv("data/customer_churn.csv")

咱们先辨别训练和验证集：

 from sklearn.model_selection import train_test_split
 
 # split the dataframe into 70% training and 30% testing sets
 train_df, test_df = train_test_split(df, test_size=0.3)

为了进行比照，咱们先应用 SVM 做一个根底模型

 fromsklearn.svmimportSVC
 fromsklearn.metricsimportclassification_report, confusion_matrix
 
 # define the range of C and gamma values to try
 c_values= [0.1, 1, 10, 100]
 gamma_values= [0.1, 1, 10, 100]
 
 # initialize variables to store the best result
 best_accuracy=0
 best_c=None
 best_gamma=None
 best_predictions=None
 
 # loop over the different combinations of C and gamma values
 forcinc_values:
     forgammaingamma_values:
         # initialize the SVM classifier with RBF kernel, C, and gamma
         clf=SVC(kernel='rbf', C=c, gamma=gamma, random_state=42)
 
         # fit the classifier on the training set
         clf.fit(train_df.drop('Churn', axis=1), train_df['Churn'])
 
         # predict the target variable of the test set
         y_pred=clf.predict(test_df.drop('Churn', axis=1))
 
         # calculate accuracy and store the result if it's the best so far
         accuracy=clf.score(test_df.drop('Churn', axis=1), test_df['Churn'])
         ifaccuracy>best_accuracy:
             best_accuracy=accuracy
             best_c=c
             best_gamma=gamma
             best_predictions=y_pred
 
 # print the best result and the confusion matrix
 print(f"Best result: C={best_c}, gamma={best_gamma}, accuracy={best_accuracy:.2f}")
 print("Confusion matrix:")
 print(confusion_matrix(test_df['Churn'], best_predictions))

能够看到反对向量机失去了 87% 的准确率，并且很好地预测了客户散失。

这段代码应用训练数据集来计算类之间的余弦类似度。

 importpandasaspd
 fromsklearn.metrics.pairwiseimportcosine_similarity
 
 # calculate the cosine similarity matrix between all rows of the dataframe
 cosine_sim=cosine_similarity(train_df.drop('Churn', axis=1))
 
 # create a dataframe from the cosine similarity matrix
 cosine_sim_df=pd.DataFrame(cosine_sim, index=train_df.index, columns=train_df.index)
 
 # create a copy of the train_df dataframe without the churn column
 train_df_no_churn=train_df.drop('Churn', axis=1)
 
 # calculate the mean cosine similarity for class 0 vs. class 0
 class0_cosine_sim_0=cosine_sim_df.loc[train_df[train_df['Churn'] ==0].index, train_df[train_df['Churn'] ==0].index].mean().mean()
 
 # calculate the mean cosine similarity for class 0 vs. class 1
 class0_cosine_sim_1=cosine_sim_df.loc[train_df[train_df['Churn'] ==0].index, train_df[train_df['Churn'] ==1].index].mean().mean()
 
 # calculate the mean cosine similarity for class 1 vs. class 1
 class1_cosine_sim_1=cosine_sim_df.loc[train_df[train_df['Churn'] ==1].index, train_df[train_df['Churn'] ==1].index].mean().mean()
 
 # display the mean cosine similarities for each pair of classes
 print('Mean cosine similarity (class 0 vs. class 0):', class0_cosine_sim_0)
 print('Mean cosine similarity (class 0 vs. class 1):', class0_cosine_sim_1)
 print('Mean cosine similarity (class 1 vs. class 1):', class1_cosine_sim_1)

上面是它们的余弦类似度:

而后咱们生成一个 DF

 importpandasaspd
 
 # create a dictionary with the mean and standard deviation values for each comparison
 data= {'comparison': ['Class 0 vs. Class 0', 'Class 0 vs. Class 1', 'Class 1 vs. Class 1'],
     'similarity_mean': [class0_cosine_sim_0, class0_cosine_sim_1, class1_cosine_sim_1],
 }
 
 # create a Pandas DataFrame from the dictionary
 df=pd.DataFrame(data)
 
 df=df.set_index('comparison').T
 
 
 # print the resulting DataFrame
 print(df)

上面就是把这个算法利用到训练数据集上。我取在训练集上创立一个 sample_churn_0，其中蕴含 10 个样本以的间隔。

 # create a DataFrame containing a random sample of 10 points where Churn is 0
 sample_churn_0=train_df[train_df['Churn'] ==0].sample(n=10)

而后将它穿插连贯到 test_df。这将使 test_df 裁减为 10 倍的行数，因为每个测试记录的右侧有 10 个示例记录。

 importpandasaspd
 
 # assume test_df and sample_churn_0 are your dataframes
 
 # add a column to both dataframes with a common value to join on
 test_df['join_col'] =1
 sample_churn_0['join_col'] =1
 
 # perform the cross-join using merge()
 result_df=pd.merge(test_df, sample_churn_0, on='join_col')
 
 # drop the join_col column from the result dataframe
 result_df=result_df.drop('join_col', axis=1)

当初咱们对穿插连贯 DF 的左侧和右侧进行余弦相似性比拟。

 importpandasaspd
 fromsklearn.metrics.pairwiseimportcosine_similarity
 
 # Extract the "_x" and "_y" columns from the result_df DataFrame, excluding the "Churn_x" and "Churn_y" columns
 df_x=result_df[[colforcolinresult_df.columnsifcol.endswith('_x') andnotcol.startswith('Churn_')]]
 df_y=result_df[[colforcolinresult_df.columnsifcol.endswith('_y') andnotcol.startswith('Churn_')]]
 
 # Calculate the cosine similarities between the two sets of vectors on each row
 cosine_sims= []
 foriinrange(len(df_x)):
     cos_sim=cosine_similarity([df_x.iloc[i]], [df_y.iloc[i]])[0][0]
     cosine_sims.append(cos_sim)
 
 # Add the cosine similarity values as a new column in the result_df DataFrame
 result_df['cos_sim'] =cosine_sims

而后用上面的代码提取所有的列名:

 x_col_names = [col for col in result_df.columns if col.endswith('_x')]

这样咱们就能够进行分组并取得每个 test_df 记录的均匀余弦类似度 (目前反复 10 次)，而后在 grouped_df 中，咱们将其重命名为 x_col_names:

 grouped_df = result_df.groupby(result_df.columns[:14].tolist()).agg({'cos_sim': 'mean'})
 
 grouped_df = grouped_df.rename_axis(x_col_names).reset_index()
 
 grouped_df.head()

最初咱们计算这 10 个样本的均匀余弦类似度。

在下面步骤中，咱们计算的分类类似度的 df 是这个：

咱们就应用这个数值作为分类的参考。首先，咱们须要将其穿插连贯到 grouped_df(与 test_df 雷同，但具备均匀余弦类似度):

 cross_df = grouped_df.merge(df, how='cross')
 cross_df = cross_df.iloc[:, :-1]

后果如下：

最初咱们失去了 3 列：Class 0 vs. Class 0, and Class 0 vs. Class 1，而后咱们须要失去类之间的差异：

 cross_df['diff_0'] = abs(cross_df['cos_sim'] - df['Class 0 vs. Class 0'].iloc[0])
 cross_df['diff_1'] = abs(cross_df['cos_sim'] - df['Class 0 vs. Class 1'].iloc[0])

预测的代码如下：

 # Add a new column 'predicted_churn'
 cross_df['predicted_churn'] = ''
 
 # Loop through each row and check the minimum difference
 for idx, row in cross_df.iterrows():
     if row['diff_0'] < row['diff_1']:
         cross_df.at[idx, 'predicted_churn'] = 0
     else:
         cross_df.at[idx, 'predicted_churn'] = 1

最初咱们看看后果：

 grouped_df__2 = cross_df.groupby(['predicted_churn', 'Churn_x']).size().reset_index(name='count')
 grouped_df__2['percentage'] = grouped_df__2['count'] / grouped_df__2['count'].sum() * 100
 
 grouped_df__2.head()

能够看到，模型的准确率为 84.25%。然而咱们能够看到，他的混同矩阵看到对于某些预测要比 svm 好，也就是说它能够在肯定水平上解决类别不均衡的问题。

余弦相似性自身并不能间接解决类别不均衡的问题，因为它只是一种计算类似度的办法，而不是一个分类器。然而，余弦相似性能够作为特色示意办法，来进步类别不均衡数据集的分类性能。本文只是作为一个样例还有能够进步的空间。

本文的数据集在这里：

https://avoid.overfit.cn/post/5cd4d22b523c418cb5d716e942a7ed46

如果你有趣味能够自行尝试。

作者：Ashutosh Malgaonkar

关于人工智能:余弦相似度算法进行客户流失分类预测

数据集

SVM

余弦类似度算法

总结