<article class=“article fmt article-content”><p>异样值检测各个领域的要害工作之一。PyOD是Python Outlier Detection的缩写,能够简化多变量数据集中辨认异样值的过程。在本文中,咱们将介绍PyOD包,并通过理论给出具体的代码示例</p><p></p><h2>PyOD简介</h2><p>PyOD为异样值检测提供了宽泛的算法汇合,实用于有监督和无监督的场景。无论解决的是带标签的数据还是未带标签的数据,PyOD都提供了一系列技术来满足特定需要。PyOD的突出个性之一是其用户敌对的API,使老手和有教训的从业者都能够轻松的拜访它。</p><h2>示例1:kNN</h2><p>咱们从一个简略的例子开始,利用k近邻(kNN)算法进行离群值检测。</p><p>首先从PyOD导入必要的模块</p><pre><code> from pyod.models.knn import KNN from pyod.utils.data import generate_data from pyod.utils.data import evaluate_print</code></pre><p>咱们生成具备预约义离群率的合成数据来模仿异样值。</p><pre><code> contamination = 0.1 # percentage of outliers n_train = 200 # number of training points n_test = 100 # number of testing points X_train, X_test, y_train, y_test = generate_data( n_train=n_train, n_test=n_test, contamination=contamination)</code></pre><p>初始化kNN检测器,将其与训练数据拟合,并取得离群值预测。</p><pre><code> clf_name = ‘KNN’ clf = KNN() clf.fit(X_train)</code></pre><p>应用ROC和Precision @ Rank n指标评估训练模型在训练和测试数据集上的性能。</p><pre><code> print("\nOn Training Data:") evaluate_print(clf_name, y_train, clf.decision_scores_) print("\nOn Test Data:") evaluate_print(clf_name, y_test, clf.decision_function(X_test))</code></pre><p>最初能够应用内置的可视化性能可视化离群检测后果。</p><pre><code> from pyod.utils.data import visualize visualize(clf_name, X_train, y_train, X_test, y_test, clf.labels_, clf.predict(X_test), show_figure=True, save_figure=False)</code></pre><p></p><p>这是一个简略的用法示例</p><h2>示例2 模型集成</h2><p>异样值检测有时会受到模型不稳定性的影响,特地是在无监督的状况下。所以PyOD提供了模型组合技术来进步鲁棒性。</p><pre><code> import numpy as np from sklearn.model_selection import train_test_split from scipy.io import loadmat from pyod.models.knn import KNN from pyod.models.combination import aom, moa, average, maximization, median from pyod.utils.utility import standardizer from pyod.utils.data import generate_data from pyod.utils.data import evaluate_print X, y = generate_data(train_only=True) # load data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4) # standardizing data for processing X_train_norm, X_test_norm = standardizer(X_train, X_test) n_clf = 20 # number of base detectors # Initialize 20 base detectors for combination k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200] train_scores = np.zeros([X_train.shape[0], n_clf]) test_scores = np.zeros([X_test.shape[0], n_clf]) print(‘Combining {n_clf} kNN detectors’.format(n_clf=n_clf)) for i in range(n_clf): k = k_list[i] clf = KNN(n_neighbors=k, method=‘largest’) clf.fit(X_train_norm) train_scores[:, i] = clf.decision_scores_ test_scores[:, i] = clf.decision_function(X_test_norm) # Decision scores have to be normalized before combination train_scores_norm, test_scores_norm = standardizer(train_scores, test_scores) # Combination by average y_by_average = average(test_scores_norm) evaluate_print(‘Combination by Average’, y_test, y_by_average) # Combination by max y_by_maximization = maximization(test_scores_norm) evaluate_print(‘Combination by Maximization’, y_test, y_by_maximization) # Combination by median y_by_median = median(test_scores_norm) evaluate_print(‘Combination by Median’, y_test, y_by_median) # Combination by aom y_by_aom = aom(test_scores_norm, n_buckets=5) evaluate_print(‘Combination by AOM’, y_test, y_by_aom) # Combination by moa y_by_moa = moa(test_scores_norm, n_buckets=5) evaluate_print(‘Combination by MOA’, y_test, y_by_moa)</code></pre><p>如果下面代码提醒谬误,须要装置combo包</p><pre><code> pip install combo</code></pre><h2>总结</h2><p>能够看到,PyOD进行离群值检测是十分不便的,从根本的kNN离群值检测到模型组合,PyOD都提供了一个全面的整合,这使得咱们能够轻松高效地解决异样值检测工作。</p><p>最初pyod的文档和官网<br/>https://avoid.overfit.cn/post/9df020be7be84d759aeef2dfa8e4d8cd</p></article>
...