关于机器学习:数据分析5个数据相关性指标

相似性度量是许多数据分析和机器学习工作中的重要工具，使咱们可能比拟和评估不同数据片段之间的相似性。有许多不同的指标可用，每个指标各有利弊，实用于不同的数据类型和工作。

本文将探讨一些最常见的相似性指标并比拟它们的优缺点。通过理解这些指标的特点和局限性，咱们能够抉择最适宜咱们特定需要的指标，并确保后果的准确性和相关性。

该指标计算 n 维空间中两点之间的直线间隔。它罕用于间断的数值数据，易于了解和实现。然而，它可能对异样值很敏感，并且没有思考不同特色的绝对重要性。

from scipy.spatial import distance

# Calculate Euclidean distance between two points
point1 = [1, 2, 3]
point2 = [4, 5, 6]

# Use the euclidean function from scipy's distance module to calculate the Euclidean distance
euclidean_distance = distance.euclidean(point1, point2)

该指标通过思考两点坐标在每个维度中的相对差别并将它们相加来计算两点之间的间隔。它对离群点的敏感性不如欧氏间隔，但在某些状况下可能无奈精确反映点与点之间的理论间隔。

from scipy.spatial import distance

# Calculate Manhattan distance between two points
point1 = [1, 2, 3]
point2 = [4, 5, 6]

# Use the cityblock function from scipy's distance module to calculate the Manhattan distance
manhattan_distance = distance.cityblock(point1, point2)

# Print the result
print("Manhattan Distance between the given two points:" + \
      str(manhattan_distance))

该指标通过思考角度来计算两个向量之间的类似度。它通常用于文本数据并且能够抵制向量大小的变动。然而，它没有思考不同特色的绝对重要性。

from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity between two vectors
vector1 = [1, 2, 3]
vector2 = [4, 5, 6]

# Use the cosine_similarity function from scikit-learn to calculate the similarity
cosine_sim = cosine_similarity([vector1], [vector2])[0][0]

# Print the result
print("Cosine Similarity between the given two vectors:" + \
      str(cosine_sim))Jaccard Similarity

该指标通过思考两个汇合的交加和并集的大小来计算两个汇合之间的相似性。它通常用于分类数据并且能够抵制汇合大小的变动。然而，它不思考汇合的程序或元素的频率。

def jaccard_similarity(list1, list2):
    """
    Calculates the Jaccard similarity between two lists.
    
    Parameters:
    list1 (list): The first list to compare.
    list2 (list): The second list to compare.
    
    Returns:
    float: The Jaccard similarity between the two lists.
    """
    # Convert the lists to sets for easier comparison
    s1 = set(list1)
    s2 = set(list2)
    
    # Calculate the Jaccard similarity by taking the length of the intersection of the sets
    # and dividing it by the length of the union of the sets
    return float(len(s1.intersection(s2)) / len(s1.union(s2)))

# Calculate Jaccard similarity between two sets
set1 = [1, 2, 3]
set2 = [2, 3, 4]
jaccard_sim = jaccard_similarity(set1, set2)

# Print the result
print("Jaccard Similarity between the given two sets:" + \
      str(jaccard_sim))

该指标计算两个变量之间的线性相关性。它通常用于间断的数值数据，并思考不同特色的绝对重要性。然而，它可能无奈精确反映非线性关系。

import numpy as np

# Calculate Pearson correlation coefficient between two variables
x = [1, 2, 3, 4]
y = [2, 3, 4, 5]

# Numpy corrcoef function to calculate the Pearson correlation coefficient and p-value
pearson_corr = np.corrcoef(x, y)[0][1]

# Print the result
print("Pearson Correlation between the given two variables:" + \
      str(pearson_corr))

欢送 Star -> 学习目录

本文由 mdnice 多平台公布

关于机器学习:数据分析5个数据相关性指标

1. 介绍

2. 指标

2.1. 欧几里得间隔

2.2. 曼哈顿间隔

2.3. 余弦类似度

2.4. Jaccard 类似度

2.5. 皮尔逊相关系数