关于数据清洗:机器学习建模高级用法构建企业级AI建模流水线-⛵

💡 作者：韩信子@ShowMeAI
📘 机器学习实战系列: https://www.showmeai.tech/tutorials/41
📘 本文地址：https://www.showmeai.tech/article-detail/287
📢 申明：版权所有，转载请分割平台与作者并注明出处
📢 珍藏ShowMeAI查看更多精彩内容

机器学习与流水线（pipeline）简介

咱们晓得机器学习利用过程蕴含很多步骤，如图所示『规范机器学习利用流程』，有数据预处理、特色工程、模型训练、模型迭代优化、部署预估等环节。

在简略剖析与建模时，能够对每个板块进行独自的构建和利用。但在企业级利用中，咱们更心愿机器学习我的项目中的不同环节有序地构建成工作流（pipeline），这样不同流程步骤更易于了解、可重现、也能够避免数据透露等问题。

罕用的机器学习建模工具，比方 Scikit-Learn，它的高级性能就笼罩了 pipeline，蕴含转换器、模型和其余模块等。

对于 Scikit-Learn 的利用办法能够参考ShowMeAI 📘机器学习实战教程 中的文章 📘SKLearn最全利用指南，也能够返回 Scikit-Learn 速查表 获取高密度的知识点清单。

然而，SKLearn 的繁难用法下，如果咱们把内部工具库，比方解决数据样本不平衡的 imblearn合并到 pipeline 中，却可能呈现不兼容问题，比方有如下报错：

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string ‘passthrough’ ‘SMOTE()’ (type <class ‘imblearn.over_sampling._smote.base.SMOTE’>) doesn’t

本文以『客户散失』为例，解说如何构建 SKLearn 流水线，具体地说蕴含：

构建一个流水线(pipeline) ，会笼罩到 Scikit-Learn、 imblearn 和 feature-engine 工具的利用
在编码步骤（例如 one-hot 编码）之后提取特色
构建特色重要度图

最终解决方案如下图所示：在一个管道中组合来自不同包的多个模块。

咱们上面的计划流程，笼罩了上述的不同环节：

步骤 ①：数据预处理：数据荡涤
步骤 ②：特色工程：数值型和类别型特色解决
步骤 ③：样本解决：类别非均衡解决
步骤 ④：逻辑回归、xgboost、随机森林及投票集成
步骤 ⑤：超参数调优与特色重要度剖析

💡 步骤0：筹备和加载数据

咱们先导入所需的工具库。

# 数据处理与绘图
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sklearn工具库
from sklearn.model_selection import train_test_split, RandomizedSearchCV, RepeatedStratifiedKFold, cross_validate

# pipeline流水线相干
from sklearn import set_config
from sklearn.pipeline import make_pipeline, Pipeline
from imblearn.pipeline import Pipeline as imbPipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# 常数列、缺失列、反复列 等解决
from feature_engine.selection import DropFeatures, DropConstantFeatures, DropDuplicateFeatures

# 非均衡解决、样本采样
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# 建模模型
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.inspection import permutation_importance
from scipy.stats import loguniform

# 流水线可视化
set_config(display="diagram")

如果你之前没有据说过 imblearn 和 feature-engine 工具包，咱们做一个简略的阐明：

📘Imblearn 能够解决类别不均衡的分类问题，内置不同的采样策略

📘feature-engine 用于特色列的解决（常数列、缺失列、反复列等）

数据集：报纸订阅用户散失

咱们这里用到的数据集来自 Kaggle 较量 Newspaper churn。数据集包含15856条当初或已经订阅该报纸的集体记录。

🏆 实战数据集下载（百度网盘）：公众号『ShowMeAI钻研核心』回复『实战』，或者点击这里获取本文 [[14] 机器学习建模利用流水线 pipeline](https://www.showmeai.tech/art…) 『Newspaper churn 数据集』

⭐ ShowMeAI官网GitHub：https://github.com/ShowMeAI-Hub

数据集蕴含人口统计信息，如代表家庭收入的HH信息、屋宇所有权、小孩信息、种族、寓居年份、年龄范畴、语言；地理信息如地址、州、市、县和邮政编码。另外，用户抉择的订阅期长，以及与之相干的免费数据。该数据集还包含用户的起源渠道。最初会有字段表征客户是否依然是咱们的订户(是否散失)。

数据预处理与切分

咱们先加载数据并进行预处理（例如将所有列名都小写并将指标变量转换为布尔值）。

# 读取数据
data = pd.read_excel("NewspaperChurn new version.xlsx")

#数据预处理
data.columns = [k.lower().replace(" ", "_") for k in data.columns]
data.rename(columns={'subscriber':'churn'}, inplace=True)
data['churn'].replace({'NO':False, 'YES':True}, inplace=True)

# 类型转换
data[data.select_dtypes(['object']).columns] = data.select_dtypes(['object']).apply(lambda x: x.astype('category'))

# 取出特色列和标签列
X = data.drop("churn", axis=1)
y = data["churn"]

# 训练集验证集切分
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

预处理过后的数据应如下所示：

💡 步骤1：数据荡涤

咱们构建的 pipeline 流程的第一步是『数据荡涤』，删除对预测没有帮忙的列（比方 id 类字段，恒定值字段，或者反复的字段）。

# 步骤1：数据荡涤+字段解决
ppl = Pipeline([
    ('drop_columns', DropFeatures(['subscriptionid'])),
    ('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
    ('drop_duplicates', DropDuplicateFeatures())
])

下面的代码创立了一个 pipeline 对象，它蕴含 3 个步骤：drop_columns、drop_constant_values、drop_duplicates。

这些步骤是元组状态的，第一个元素定义了步骤的名称（如 drop_columns），第二个元素定义了转换器（如 DropFeatures()）。

这些简略的步骤，大家也能够通过 pandas 之类的内部工具轻松实现。然而，咱们在组装流水线时的想法是在pipeline中集成尽可能多的性能。

💡 步骤2：特色工程与数据变换

在后面剔除不相干的列之后，咱们接下来做一下缺失值解决和特色工程。能够看到数据集蕴含不同类型的列（数值型和类别型），咱们会针对这两个类型定义两个独立的工作流程。

对于特色工程，能够查看ShowMeAI 📘机器学习实战教程 中的文章 📘机器学习特色工程最全解读。

# 数据处理与特色工程pipeline

ppl = Pipeline([
    # ① 剔除无关列
    ('drop_columns', DropFeatures(['subscriptionid'])),
    ('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
    ('drop_duplicates', DropDuplicateFeatures()),
    
    # ② 缺失值填充与数值/类别型特色解决
    ('cleaning', ColumnTransformer([
        # 2.1: 数值型字段缺失值填充与幅度缩放
        ('num',make_pipeline(
            SimpleImputer(strategy='mean'),
            MinMaxScaler()),
         make_column_selector(dtype_include='int64')
        ),
        # 2.2：类别型字段缺失值填充与独热向量编码
        ('cat',make_pipeline(
            SimpleImputer(strategy='most_frequent'),
            OneHotEncoder(sparse=False, handle_unknown='ignore')),
         make_column_selector(dtype_include='category')
        )])
    )
])

增加一个名为clearning 的步骤，对应一个 ColumnTransformer 对象。

在 ColumnTransformer 中，设置了两个新 pipeline：一个用于解决数值型，一个用于类别型解决。通过 make_column_selector 函数确保每次选出的字段类型是对的。

这里应用 dtype_include 参数抉择对应类型的列，这个函数也能够提供列名列表或正则表达式来抉择。

💡 步骤3：类别非均衡解决（数据采样）

在『用户散失』和『欺诈辨认』这样的问题场景中，一个十分大的挑战就是『类别不均衡』——也就是说，散失用户绝对于非散失用户来说，数量较少。

这里咱们会采纳到一个叫做 im`blearn` 的工具库来解决类别非均衡问题，它提供了一系列数据生成与采样的办法来缓解上述问题。本次选用 SMOTE 采样办法来对少的类别样本进行重采样。

SMOTE类别非均衡解决

增加 SMOTE 步骤后的 pipeline 如下：

# 总体解决pipeline

ppl = Pipeline([
    # ① 剔除无关列
    ('drop_columns', DropFeatures(['subscriptionid'])),
    ('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
    ('drop_duplicates', DropDuplicateFeatures()),
    
    # ② 缺失值填充与数值/类别型特色解决
    ('cleaning', ColumnTransformer([
        # 2.1: 数值型字段缺失值填充与幅度缩放
        ('num',make_pipeline(
            SimpleImputer(strategy='mean'),
            MinMaxScaler()),
         make_column_selector(dtype_include='int64')
        ),
        # 2.2：类别型字段缺失值填充与独热向量编码
        ('cat',make_pipeline(
            SimpleImputer(strategy='most_frequent'),
            OneHotEncoder(sparse=False, handle_unknown='ignore')),
         make_column_selector(dtype_include='category')
        )])
    ),
    # ③ 类别非均衡解决：重采样
    ('smote', SMOTE())
])

pipeline 特色校验

在最终构建集成分类器模型之前，咱们查看一下通过 pipeline 解决失去的特色名称和其余信息。

pipeline 对象提供了一个名为 get_feature_names_out() 的函数，咱们能够通过它获取特色名称。但在应用它之前，咱们必须在数据集上拟合。因为第 ③ 步 SMOTE 解决仅关注咱们的标签 y 数据，咱们临时疏忽它并专一于第 ① 和 ② 步。

# 拟合数据，获取pipeline构建的特色名称和信息
ppl_fts = ppl[0:4]
ppl_fts.fit(X_train, y_train)
features = ppl_fts.get_feature_names_out()
pd.Series(features)

后果如下所示：

0                    num__year_of_residence
1                             num__zip_code
2                       num__reward_program
3        cat__hh_income_$  20,000 - $29,999
4        cat__hh_income_$  30,000 - $39,999
                        ...                
12122               cat__source_channel_TMC
12123            cat__source_channel_TeleIn
12124           cat__source_channel_TeleOut
12125               cat__source_channel_VRU
12126          cat__source_channel_iSrvices
Length: 12127, dtype: object

因为独热向量编码，许多带着 cat_ 结尾（代表 category）的特色名已被创立。

如果大家想得到下面流程图一样的 pipeline 可视化，只需在代码中做一点小小的批改，在调用 pipeline 对象之前在您的代码中增加 set_config(display="diagram")。

💡 步骤4：构建集成分类器

下一步咱们训练多个模型，并应用功能强大的集成模型（投票分类器）来解决以后问题。

对于这里应用到的逻辑回归、随机森林和 xgboost 模型，大家能够在 ShowMeAI 的 📘图解机器学习算法教程 中看到具体的原理解说。

# 逻辑回归模型
lr = LogisticRegression(warm_start=True, max_iter=400)
# 随机森林模型
rf = RandomForestClassifier()
# xgboost
xgb = XGBClassifier(tree_method="hist", verbosity=0, silent=True)
# 用投票器进行集成
lr_xgb_rf = VotingClassifier(estimators=[('lr', lr), ('xgb', xgb), ('rf', rf)], 
                             voting='soft')

定义集成模型后，咱们也把它集成到咱们的 pipeline 中。

# 总体解决pipeline

ppl = imbPipeline([
    # ① 剔除无关列
    ('drop_columns', DropFeatures(['subscriptionid'])),
    ('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
    ('drop_duplicates', DropDuplicateFeatures()),
    
    # ② 缺失值填充与数值/类别型特色解决
    ('cleaning', ColumnTransformer([
        # 2.1: 数值型字段缺失值填充与幅度缩放
        ('num',make_pipeline(
            SimpleImputer(strategy='mean'),
            MinMaxScaler()),
         make_column_selector(dtype_include='int64')
        ),
        # 2.2：类别型字段缺失值填充与独热向量编码
        ('cat',make_pipeline(
            SimpleImputer(strategy='most_frequent'),
            OneHotEncoder(sparse=False, handle_unknown='ignore')),
         make_column_selector(dtype_include='category')
        )])
    ),
    # ③ 类别非均衡解决：重采样
    ('smote', SMOTE()),
    # ④ 投票器集成
    ('ensemble', lr_xgb_rf)
])

大家可能会留神到，咱们在第1行中应用到的 Pipeline 替换成了 imblearn 的 imbPipeline 。这是很要害的一个解决，如果咱们应用 SKLearn 的 pipeline，在拟合时会呈现文初提到的谬误：

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't

到这一步，咱们就把根本的 pipeline 流程构建好了。

💡 步骤5：超参数调整和特色重要性

超参数调优

咱们构建的整条建模流水线中，很多组件都有超参数能够调整，这些超参数会影响最终的模型成果。对 pipeline 如何进行超参数调优呢，咱们选用随机搜寻 RandomizedSearchCV 对超参数进行调优，代码如下。

对于搜寻调参的具体原理常识，大家能够查看 ShowMeAI 在文章 📘网络优化: 超参数调优、正则化、批归一化和程序框架 中的介绍。

大家特地留神代码中的命名规定。

# 超参数调优
params = {
    'ensemble__lr__solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'ensemble__lr__penalty': ['none', 'l1', 'l2', 'elasticnet'],
    'ensemble__lr__C': loguniform(1e-5, 100),
    'ensemble__xgb__learning_rate': [0.1],
    'ensemble__xgb__max_depth': [7, 10, 15, 20],
    'ensemble__xgb__min_child_weight': [10, 15, 20, 25],
    'ensemble__xgb__colsample_bytree': [0.8, 0.9, 1],
    'ensemble__xgb__n_estimators': [300, 400, 500, 600],
    'ensemble__xgb__reg_alpha': [0.5, 0.2, 1],
    'ensemble__xgb__reg_lambda': [2, 3, 5],
    'ensemble__xgb__gamma': [1, 2, 3],
    'ensemble__rf__max_depth': [7, 10, 15, 20],
    'ensemble__rf__min_samples_leaf': [1, 2, 4],
    'ensemble__rf__min_samples_split': [2, 5, 10],
    'ensemble__rf__n_estimators': [300, 400, 500, 600],
}

# 随机搜寻调参
rsf = RepeatedStratifiedKFold(random_state=42)
clf = RandomizedSearchCV(ppl, params,scoring='roc_auc', verbose=2, cv=rsf)
clf.fit(X_train, y_train)

# 输入信息
print("Best Score: ", clf.best_score_)
print("Best Params: ", clf.best_params_)
print("AUC:", roc_auc_score(y_val, clf.predict(X_val)))

解释一下下面代码中的超参数命名：

第一个参数（ ensemble__ ）：咱们的 VotingClassifier 的名称
第二个参数（ lr__ ）：咱们集成中应用的模型的名称
第三个参数（ solver ）：模型相干超参数的名称

因为这里是类别不均衡场景，咱们应用反复分层 k-fold ( RepeatedStratifiedKFold）。

超参数调优这一步也不是必要的，在简略的场景下，大家能够间接应用默认参数，或者在定义模型的时候敲定超参数。

特色重要度图

为了不让咱们的模型成为黑箱模型，咱们心愿对模型做一些解释，其中最重要的是归因剖析，咱们心愿理解哪些特色是重要的，这里咱们对特色重要度进行绘制。

# https://inria.github.io/scikit-learn-mooc/python_scripts/dev_features_importance.html
# 绘制特色重要度
def plot_feature_importances(perm_importance_result, feat_name):
    """ bar plot the feature importance """
    fig, ax = plt.subplots()


    indices = perm_importance_result['importances_mean'].argsort()
    plt.barh(range(len(indices)),
             perm_importance_result['importances_mean'][indices],
             xerr=perm_importance_result['importances_std'][indices])
    ax.set_yticks(range(len(indices)))
    ax.set_title("Permutation importance")
    
    tmp = np.array(feat_name)
    _ = ax.set_yticklabels(tmp[indices])


# 获取特色名称
ppl_fts = ppl[0:4]
ppl_fts.fit(X_train, y_train)
features = ppl_fts.get_feature_names_out()


# 用乱序法进行特色重要度计算和排列，以及绘图
perm_importance_result_train = permutation_importance(clf, X_train, y_train, random_state=42)
plot_feature_importances(perm_importance_result_train, features)

上述代码运行后的后果图如下，咱们能够看到特色 hh_income 在预测中占主导地位。因为这个特色其实是能够排序的（比方 30-40k 比 150-175k 要小），咱们能够应用不同的编码方式（比方应用 LabelEncoding 标签编码）。

以上就是残缺的机器学习流水线构建过程，大家能够看到，pipeline 能够把不同的环节集成在一起，一次性运行与调优，代码和流程都更为简洁紧凑，效率也更高。

参考资料

🏆 实战数据集下载（百度网盘）：公众号『ShowMeAI钻研核心』回复『实战』，或者点击这里获取本文 [[14] 机器学习建模利用流水线 pipeline](https://www.showmeai.tech/art…) 『Newspaper churn 数据集』
⭐ ShowMeAI官网GitHub：https://github.com/ShowMeAI-Hub
- 📘 机器学习实战教程: https://www.showmeai.tech/tutorials/41
- 📘 SKLearn最全利用指南: https://www.showmeai.tech/article-detail/203
- 📘 Imblearn 解决类别不均衡的分类: https://imbalanced-learn.org/stable/
- 📘 feature-engine 特色列的解决（常数列、缺失列、反复列等）: https://feature-engine.readthedocs.io/en/latest/
- 📘 机器学习实战教程: http://showmeai.tech/tutorials/41
- 📘 机器学习特色工程最全解读: https://www.showmeai.tech/article-detail/208
- 📘 图解机器学习算法教程: http://showmeai.tech/tutorials/34
- 📘 网络优化: 超参数调优、正则化、批归一化和程序框架: https://www.showmeai.tech/article-detail/218
- 📘 Scikit-Learn 速查表: https://www.showmeai.tech/article-detail/108

关于数据清洗:机器学习建模高级用法构建企业级AI建模流水线-⛵

机器学习与流水线（pipeline）简介

💡 步骤0：筹备和加载数据

数据集：报纸订阅用户散失

数据预处理与切分

💡 步骤1：数据荡涤

💡 步骤2：特色工程与数据变换

💡 步骤3：类别非均衡解决（数据采样）

SMOTE类别非均衡解决

pipeline 特色校验

💡 步骤4：构建集成分类器

💡 步骤5：超参数调整和特色重要性

超参数调优

特色重要度图

参考资料

评论

发表回复取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

关于数据清洗:机器学习建模高级用法构建企业级AI建模流水线-⛵

机器学习与流水线（pipeline）简介

💡 步骤0：筹备和加载数据

数据集：报纸订阅用户散失

数据预处理与切分

💡 步骤1：数据荡涤

💡 步骤2：特色工程与数据变换

💡 步骤3：类别非均衡解决（数据采样）

SMOTE类别非均衡解决

pipeline 特色校验

💡 步骤4：构建集成分类器

💡 步骤5：超参数调整和特色重要性

超参数调优

特色重要度图

参考资料

评论

发表回复 取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

发表回复取消回复