余弦相似性是一种用于计算两个向量之间类似度的办法,常被用于文本分类和信息检索畛域。具体来说,假如有两个向量A和B,它们的余弦类似度能够通过以下公式计算:
其中,dot_product(A, B)示意向量A和B的点积,norm(A)和norm(B)别离示意向量A和B的范数。如果A和B越类似,它们的余弦类似度就越靠近1,反之亦然。
数据集
咱们这里用的演示数据集来自一个datacamp:
这个数据集来自一家伊朗电信公司,每一行代表一个客户一年的工夫。除了客户散失标签,还有客户流动的信息,比方呼叫失败和订阅时长等等。咱们最初要预测的是这个客户是否散失,也就是一个二元分类的问题。
数据集如下:
import pandas as pd df = pd.read_csv("data/customer_churn.csv")
咱们先辨别训练和验证集:
from sklearn.model_selection import train_test_split # split the dataframe into 70% training and 30% testing sets train_df, test_df = train_test_split(df, test_size=0.3)
SVM
为了进行比照,咱们先应用SVM做一个根底模型
fromsklearn.svmimportSVC fromsklearn.metricsimportclassification_report, confusion_matrix # define the range of C and gamma values to try c_values= [0.1, 1, 10, 100] gamma_values= [0.1, 1, 10, 100] # initialize variables to store the best result best_accuracy=0 best_c=None best_gamma=None best_predictions=None # loop over the different combinations of C and gamma values forcinc_values: forgammaingamma_values: # initialize the SVM classifier with RBF kernel, C, and gamma clf=SVC(kernel='rbf', C=c, gamma=gamma, random_state=42) # fit the classifier on the training set clf.fit(train_df.drop('Churn', axis=1), train_df['Churn']) # predict the target variable of the test set y_pred=clf.predict(test_df.drop('Churn', axis=1)) # calculate accuracy and store the result if it's the best so far accuracy=clf.score(test_df.drop('Churn', axis=1), test_df['Churn']) ifaccuracy>best_accuracy: best_accuracy=accuracy best_c=c best_gamma=gamma best_predictions=y_pred # print the best result and the confusion matrix print(f"Best result: C={best_c}, gamma={best_gamma}, accuracy={best_accuracy:.2f}") print("Confusion matrix:") print(confusion_matrix(test_df['Churn'], best_predictions))
能够看到反对向量机失去了87%的准确率,并且很好地预测了客户散失。
余弦类似度算法
这段代码应用训练数据集来计算类之间的余弦类似度。
importpandasaspd fromsklearn.metrics.pairwiseimportcosine_similarity # calculate the cosine similarity matrix between all rows of the dataframe cosine_sim=cosine_similarity(train_df.drop('Churn', axis=1)) # create a dataframe from the cosine similarity matrix cosine_sim_df=pd.DataFrame(cosine_sim, index=train_df.index, columns=train_df.index) # create a copy of the train_df dataframe without the churn column train_df_no_churn=train_df.drop('Churn', axis=1) # calculate the mean cosine similarity for class 0 vs. class 0 class0_cosine_sim_0=cosine_sim_df.loc[train_df[train_df['Churn'] ==0].index, train_df[train_df['Churn'] ==0].index].mean().mean() # calculate the mean cosine similarity for class 0 vs. class 1 class0_cosine_sim_1=cosine_sim_df.loc[train_df[train_df['Churn'] ==0].index, train_df[train_df['Churn'] ==1].index].mean().mean() # calculate the mean cosine similarity for class 1 vs. class 1 class1_cosine_sim_1=cosine_sim_df.loc[train_df[train_df['Churn'] ==1].index, train_df[train_df['Churn'] ==1].index].mean().mean() # display the mean cosine similarities for each pair of classes print('Mean cosine similarity (class 0 vs. class 0):', class0_cosine_sim_0) print('Mean cosine similarity (class 0 vs. class 1):', class0_cosine_sim_1) print('Mean cosine similarity (class 1 vs. class 1):', class1_cosine_sim_1)
上面是它们的余弦类似度:
而后咱们生成一个DF
importpandasaspd # create a dictionary with the mean and standard deviation values for each comparison data= { 'comparison': ['Class 0 vs. Class 0', 'Class 0 vs. Class 1', 'Class 1 vs. Class 1'], 'similarity_mean': [class0_cosine_sim_0, class0_cosine_sim_1, class1_cosine_sim_1], } # create a Pandas DataFrame from the dictionary df=pd.DataFrame(data) df=df.set_index('comparison').T # print the resulting DataFrame print(df)
上面就是把这个算法利用到训练数据集上。我取在训练集上创立一个sample_churn_0,其中蕴含10个样本以的间隔。
# create a DataFrame containing a random sample of 10 points where Churn is 0 sample_churn_0=train_df[train_df['Churn'] ==0].sample(n=10)
而后将它穿插连贯到test_df。这将使test_df裁减为10倍的行数,因为每个测试记录的右侧有10个示例记录。
importpandasaspd # assume test_df and sample_churn_0 are your dataframes # add a column to both dataframes with a common value to join on test_df['join_col'] =1 sample_churn_0['join_col'] =1 # perform the cross-join using merge() result_df=pd.merge(test_df, sample_churn_0, on='join_col') # drop the join_col column from the result dataframe result_df=result_df.drop('join_col', axis=1)
当初咱们对穿插连贯DF的左侧和右侧进行余弦相似性比拟。
importpandasaspd fromsklearn.metrics.pairwiseimportcosine_similarity # Extract the "_x" and "_y" columns from the result_df DataFrame, excluding the "Churn_x" and "Churn_y" columns df_x=result_df[[colforcolinresult_df.columnsifcol.endswith('_x') andnotcol.startswith('Churn_')]] df_y=result_df[[colforcolinresult_df.columnsifcol.endswith('_y') andnotcol.startswith('Churn_')]] # Calculate the cosine similarities between the two sets of vectors on each row cosine_sims= [] foriinrange(len(df_x)): cos_sim=cosine_similarity([df_x.iloc[i]], [df_y.iloc[i]])[0][0] cosine_sims.append(cos_sim) # Add the cosine similarity values as a new column in the result_df DataFrame result_df['cos_sim'] =cosine_sims
而后用上面的代码提取所有的列名:
x_col_names = [col for col in result_df.columns if col.endswith('_x')]
这样咱们就能够进行分组并取得每个test_df记录的均匀余弦类似度(目前反复10次),而后在grouped_df中,咱们将其重命名为x_col_names:
grouped_df = result_df.groupby(result_df.columns[:14].tolist()).agg({'cos_sim': 'mean'}) grouped_df = grouped_df.rename_axis(x_col_names).reset_index() grouped_df.head()
最初咱们计算这10个样本的均匀余弦类似度。
在下面步骤中,咱们计算的分类类似度的df是这个:
咱们就应用这个数值作为分类的参考。首先,咱们须要将其穿插连贯到grouped_df(与test_df雷同,但具备均匀余弦类似度):
cross_df = grouped_df.merge(df, how='cross') cross_df = cross_df.iloc[:, :-1]
后果如下:
最初咱们失去了3列:Class 0 vs. Class 0, and Class 0 vs. Class 1,而后咱们须要失去类之间的差异:
cross_df['diff_0'] = abs(cross_df['cos_sim'] - df['Class 0 vs. Class 0'].iloc[0]) cross_df['diff_1'] = abs(cross_df['cos_sim'] - df['Class 0 vs. Class 1'].iloc[0])
预测的代码如下:
# Add a new column 'predicted_churn' cross_df['predicted_churn'] = '' # Loop through each row and check the minimum difference for idx, row in cross_df.iterrows(): if row['diff_0'] < row['diff_1']: cross_df.at[idx, 'predicted_churn'] = 0 else: cross_df.at[idx, 'predicted_churn'] = 1
最初咱们看看后果:
grouped_df__2 = cross_df.groupby(['predicted_churn', 'Churn_x']).size().reset_index(name='count') grouped_df__2['percentage'] = grouped_df__2['count'] / grouped_df__2['count'].sum() * 100 grouped_df__2.head()
能够看到,模型的准确率为84.25%。然而咱们能够看到,他的混同矩阵看到对于某些预测要比svm好,也就是说它能够在肯定水平上解决类别不均衡的问题。
总结
余弦相似性自身并不能间接解决类别不均衡的问题,因为它只是一种计算类似度的办法,而不是一个分类器。然而,余弦相似性能够作为特色示意办法,来进步类别不均衡数据集的分类性能。本文只是作为一个样例还有能够进步的空间。
本文的数据集在这里:
https://avoid.overfit.cn/post/5cd4d22b523c418cb5d716e942a7ed46
如果你有趣味能够自行尝试。
作者:Ashutosh Malgaonkar