2021科大讯飞-车辆贷款守约预测挑战赛Top1--计划学习

简介

车贷守约预测问题，目标是建设危险辨认模型来预测可能守约的借款人。预测后果为借款人是否可能守约，属于二分类问题。

偏数据挖掘的较量，关键点是如何基于对数据的了解形象演绎出有用的特色。

站在大佬的视角，尝试学习总结，站在伟人的肩膀上，兴许看得会更远一些。

间接进入主题，开始学习套路，芜湖~

特色工程

1、罕用库、数据导入

import pandas as pdimport numpy as npimport lightgbm as lgbimport xgboost as xgbfrom sklearn.metrics import roc_auc_score, auc, roc_curve, accuracy_score, f1_scorefrom sklearn.model_selection import StratifiedKFoldfrom sklearn.preprocessing import StandardScaler, QuantileTransformer, KBinsDiscretizer, LabelEncoder, MinMaxScaler, PowerTransformerfrom tqdm import tqdmimport pickleimport logginglogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)import os

后半局部用了一些工具：

tqdm：一个优雅的进度条显示，不便观测跑数进度以及速度；
pickle：将对象以文件的模式寄存在磁盘上，简直所有的数据类型都能够用pickle来序列化，个别先dump，后load，相似于写出、导入的意思；作用是，一次后果屡次复用，防止反复做功，hhh，比如说A列数据处理得花2h，每次批改过后需重跑其余列数据，但毋庸批改A列数据，就能够用pickle解决这个问题，疾速调取之前的后果；
logging：控制台输入日志，不便查看运行状态；

logging.info('data loading...')train = pd.read_csv('../xfdata/车辆贷款守约预测数据集/train.csv')test = pd.read_csv('../xfdata/车辆贷款守约预测数据集/test.csv')

2、特色工程

2.1 结构特色

针对训练集、测试集：

依据业务了解，计算新的特色；
对某些比例特色进行等宽分箱（cut），对某些数值特色进行等频分箱（qcut），还有一些数值特色进行自定义分箱，划分bin的范畴；

def gen_new_feats(train, test):    '''生成新特色：如年利率/分箱等特色'''    # Step 1: 合并训练集和测试集    data = pd.concat([train, test])    # Step 2: 具体特色工程    # 计算二级账户的年利率    data['sub_Rate'] = (data['sub_account_monthly_payment'] * data['sub_account_tenure'] - data[        'sub_account_sanction_loan']) / data['sub_account_sanction_loan']    # 计算主账户的年利率    data['main_Rate'] = (data['main_account_monthly_payment'] * data['main_account_tenure'] - data[        'main_account_sanction_loan']) / data['main_account_sanction_loan']    # 对局部特色进行分箱操作    # 等宽分箱    loan_to_asset_ratio_labels = [i for i in range(10)]    data['loan_to_asset_ratio_bin'] = pd.cut(data["loan_to_asset_ratio"], 10, labels=loan_to_asset_ratio_labels)    # 等频分箱    data['asset_cost_bin'] = pd.qcut(data['asset_cost'], 10, labels=loan_to_asset_ratio_labels)    # 自定义分箱    amount_cols = [                   'total_monthly_payment',                   'main_account_sanction_loan',                   'main_account_disbursed_loan',                   'sub_account_sanction_loan',                   'sub_account_disbursed_loan',                   'main_account_monthly_payment',                   'sub_account_monthly_payment',                   'total_sanction_loan'                ]    amount_labels = [i for i in range(10)]    for col in amount_cols:        total_monthly_payment_bin = [-1, 5000, 10000, 30000, 50000, 100000, 300000, 500000, 1000000, 3000000, data[col].max()]        data[col + '_bin'] = pd.cut(data[col], total_monthly_payment_bin, labels=amount_labels).astype(int)    # Step 3: 返回蕴含新特色的训练集 & 测试集    return data[data['loan_default'].notnull()], data[data['loan_default'].isnull()]

2.2 编码-Target Encoding

Target encoding是一种联合目标值进行特色编码的形式。

在二分类中，对于特色i，target encoding在该特色取值为k时的编码值为类别k对应的目标值冀望E(y|xi=xik)。

在样本集中一共有10条记录，其中3条记录中特色Trend的取值为Up，咱们关注这3条记录。在k=Up时，目标值的冀望为2/3 ≈ 0.66，所以将Up编码为0.66。

大佬前面次要是针对id特色进行target encoding。

def gen_target_encoding_feats(train, test, encode_cols, target_col, n_fold=10):    '''生成target encoding特色'''    # for training set - cv    tg_feats = np.zeros((train.shape[0], len(encode_cols)))    kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True)    for _, (train_index, val_index) in enumerate(kfold.split(train[encode_cols], train[target_col])):        df_train, df_val = train.iloc[train_index], train.iloc[val_index]        for idx, col in enumerate(encode_cols):            target_mean_dict = df_train.groupby(col)[target_col].mean()            df_val[f'{col}_mean_target'] = df_val[col].map(target_mean_dict)            tg_feats[val_index, idx] = df_val[f'{col}_mean_target'].values    for idx, encode_col in enumerate(encode_cols):        train[f'{encode_col}_mean_target'] = tg_feats[:, idx]    # for testing set    for col in encode_cols:        target_mean_dict = train.groupby(col)[target_col].mean()        test[f'{col}_mean_target'] = test[col].map(target_mean_dict)    return train, test

说实话，这段代码还没齐全看明确~先用小本本记着，用的时候先间接掏出来，hhh

2.3 近邻欺诈特色

对于风控账户来说，存在危险的账户可能存在同批大量的注册状况，所以id可能是连着的。

这里大佬构建了近邻欺诈特色，就是每个账号的前后10个账户的lable取均值，也就代表着概率，意为可能守约账户汇集的概率，在肯定水平上代表该账户可能守约的相关性。

def gen_neighbor_feats(train, test):    '''产生近邻欺诈特色'''    if not os.path.exists('../user_data/neighbor_default_probs.pkl'):        # 该特色须要跑的工夫较久，因而将其存成了pkl文件        neighbor_default_probs = []        for i in tqdm(range(train.customer_id.max())):            if i >= 10 and i < 199706:                customer_id_neighbors = list(range(i - 10, i)) + list(range(i + 1, i + 10))            elif i < 199706:                customer_id_neighbors = list(range(0, i)) + list(range(i + 1, i + 10))            else:                customer_id_neighbors = list(range(i - 10, i)) + list(range(i + 1, 199706))            customer_id_neighbors = [customer_id_neighbor for customer_id_neighbor in customer_id_neighbors if                                     customer_id_neighbor in train.customer_id.values.tolist()]            neighbor_default_prob = train.set_index('customer_id').loc[customer_id_neighbors].loan_default.mean()            neighbor_default_probs.append(neighbor_default_prob)        df_neighbor_default_prob = pd.DataFrame({'customer_id': range(0, train.customer_id.max()),                                                 'neighbor_default_prob': neighbor_default_probs})        save_pkl(df_neighbor_default_prob, '../user_data/neighbor_default_probs.pkl')    else:        df_neighbor_default_prob = load_pkl('../user_data/neighbor_default_probs.pkl')    train = pd.merge(left=train, right=df_neighbor_default_prob, on='customer_id', how='left')    test = pd.merge(left=test, right=df_neighbor_default_prob, on='customer_id', how='left')    return train, test

2.4 特色工程后果输入

TARGET_ENCODING_FETAS = [                            'employment_type',                             'branch_id',                             'supplier_id',                             'manufacturer_id',                             'area_id',                             'employee_code_id',                             'asset_cost_bin'                         ]# 特色工程logging.info('feature generating...')train, test = gen_new_feats(train, test)train, test = gen_target_encoding_feats(train, test, TARGET_ENCODING_FETAS, target_col='loan_default', n_fold=10)train, test = gen_neighbor_feats(train, test)

特色的后续解决，比方一些转换后特色的数据类型转换，一些率值特色的简化，不便后续的模型学习，加强模型的鲁棒性。

# 保留的最终特色名称列表SAVE_FEATS = [                 'customer_id',                 'neighbor_default_prob',                 'disbursed_amount',                 'asset_cost',                 'branch_id',                 'supplier_id',                 'manufacturer_id',                 'area_id',                 'employee_code_id',                 'credit_score',                 'loan_to_asset_ratio',                 'year_of_birth',                 'age',                 'sub_Rate',                 'main_Rate',                 'loan_to_asset_ratio_bin',                 'asset_cost_bin',                 'employment_type_mean_target',                 'branch_id_mean_target',                 'supplier_id_mean_target',                 'manufacturer_id_mean_target',                 'area_id_mean_target',                 'employee_code_id_mean_target',                 'asset_cost_bin_mean_target',                 'credit_history',                 'average_age',                 'total_disbursed_loan',                 'main_account_disbursed_loan',                 'total_sanction_loan',                 'main_account_sanction_loan',                 'active_to_inactive_act_ratio',                 'total_outstanding_loan',                 'main_account_outstanding_loan',                 'Credit_level',                 'outstanding_disburse_ratio',                 'total_account_loan_no',                 'main_account_tenure',                 'main_account_loan_no',                 'main_account_monthly_payment',                 'total_monthly_payment',                 'main_account_active_loan_no',                 'main_account_inactive_loan_no',                 'sub_account_inactive_loan_no',                 'enquirie_no',                 'main_account_overdue_no',                 'total_overdue_no',                 'last_six_month_defaulted_no'            ]# 特色工程 后处理# 简化特色for col in ['sub_Rate', 'main_Rate', 'outstanding_disburse_ratio']:     train[col] = train[col].apply(lambda x: 1 if x > 1 else x)     test[col] = test[col].apply(lambda x: 1 if x > 1 else x)# 数据类型转换train['asset_cost_bin'] = train['asset_cost_bin'].astype(int)test['asset_cost_bin'] = test['asset_cost_bin'].astype(int)train['loan_to_asset_ratio_bin'] = train['loan_to_asset_ratio_bin'].astype(int)test['loan_to_asset_ratio_bin'] = test['loan_to_asset_ratio_bin'].astype(int)# 存储蕴含新特色的数据集logging.info('new data saving...')cols = SAVE_FEATS + ['loan_default', ]train[cols].to_csv('./train_final.csv', index=False)test[cols].to_csv('./test_final.csv', index=False)

模型构建

1、模型训练-穿插验证

采纳lightgbm、xgboost两种梯度晋升树模型，这里不多解释了，上面代码都成了“规范”，DDDD~

def train_lgb_kfold(X_train, y_train, X_test, n_fold=5):    '''train lightgbm with k-fold split'''    gbms = []    kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True)    oof_preds = np.zeros((X_train.shape[0],))    test_preds = np.zeros((X_test.shape[0],))    for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)):        logging.info(f'############ fold {fold} ###########')        X_tr, X_val, y_tr, y_val = X_train.iloc[train_index], X_train.iloc[val_index], y_train[train_index], y_train[val_index]        dtrain = lgb.Dataset(X_tr, y_tr)        dvalid = lgb.Dataset(X_val, y_val, reference=dtrain)        params = {            'objective': 'binary',            'metric': 'auc',            'num_leaves': 64,            'learning_rate': 0.02,            'min_data_in_leaf': 150,            'feature_fraction': 0.8,            'bagging_fraction': 0.7,            'n_jobs': -1,            'seed': 1024        }        gbm = lgb.train(params,                        dtrain,                        num_boost_round=1000,                        valid_sets=[dtrain, dvalid],                        verbose_eval=50,                        early_stopping_rounds=20)        oof_preds[val_index] = gbm.predict(X_val, num_iteration=gbm.best_iteration)        test_preds += gbm.predict(X_test, num_iteration=gbm.best_iteration) / kfold.n_splits        gbms.append(gbm)    return gbms, oof_preds, test_predsdef train_xgb_kfold(X_train, y_train, X_test, n_fold=5):    '''train xgboost with k-fold split'''    gbms = []    kfold = StratifiedKFold(n_splits=10, random_state=1024, shuffle=True)    oof_preds = np.zeros((X_train.shape[0],))    test_preds = np.zeros((X_test.shape[0],))    for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)):        logging.info(f'############ fold {fold} ###########')        X_tr, X_val, y_tr, y_val = X_train.iloc[train_index], X_train.iloc[val_index], y_train[train_index], y_train[val_index]        dtrain = xgb.DMatrix(X_tr, y_tr)        dvalid = xgb.DMatrix(X_val, y_val)        dtest = xgb.DMatrix(X_test)        params={            'booster':'gbtree',            'objective': 'binary:logistic',            'eval_metric': ['logloss', 'auc'],            'max_depth': 8,            'subsample':0.9,            'min_child_weight': 10,            'colsample_bytree':0.85,            'lambda': 10,            'eta': 0.02,            'seed': 1024        }        watchlist = [(dtrain, 'train'), (dvalid, 'test')]        gbm = xgb.train(params,                        dtrain,                        num_boost_round=1000,                        evals=watchlist,                        verbose_eval=50,                        early_stopping_rounds=20)        oof_preds[val_index] = gbm.predict(dvalid, iteration_range=(0, gbm.best_iteration))        test_preds += gbm.predict(dtest, iteration_range=(0, gbm.best_iteration)) / kfold.n_splits        gbms.append(gbm)    return gbms, oof_preds, test_preds

def train_xgb(train, test, feat_cols, label_col, n_fold=10):    '''训练xgboost'''    for col in ['sub_Rate', 'main_Rate', 'outstanding_disburse_ratio']:        train[col] = train[col].apply(lambda x: 1 if x > 1 else x)        test[col] = test[col].apply(lambda x: 1 if x > 1 else x)    X_train = train[feat_cols]    y_train = train[label_col]    X_test = test[feat_cols]    gbms_xgb, oof_preds_xgb, test_preds_xgb = train_xgb_kfold(X_train, y_train, X_test, n_fold=n_fold)    if not os.path.exists('../user_data/gbms_xgb.pkl'):        save_pkl(gbms_xgb, '../user_data/gbms_xgb.pkl')    return gbms_xgb, oof_preds_xgb, test_preds_xgbdef train_lgb(train, test, feat_cols, label_col, n_fold=10):    '''训练lightgbm'''    X_train = train[feat_cols]    y_train = train[label_col]    X_test = test[feat_cols]    gbms_lgb, oof_preds_lgb, test_preds_lgb = train_lgb_kfold(X_train, y_train, X_test, n_fold=n_fold)    if not os.path.exists('../user_data/gbms_lgb.pkl'):        save_pkl(gbms_lgb, '../user_data/gbms_lgb.pkl')    return gbms_lgb, oof_preds_lgb, test_preds_lgb

输入模型训练后果：

# 读取原始数据集logging.info('data loading...')train = pd.read_csv('../xfdata/车辆贷款守约预测数据集/train.csv')test = pd.read_csv('../xfdata/车辆贷款守约预测数据集/test.csv')# 特色工程logging.info('feature generating...')train, test = gen_new_feats(train, test)train, test = gen_target_encoding_feats(train, test, TARGET_ENCODING_FETAS, target_col='loan_default', n_fold=10)train, test = gen_neighbor_feats(train, test)train['asset_cost_bin'] = train['asset_cost_bin'].astype(int)test['asset_cost_bin'] = test['asset_cost_bin'].astype(int)train['loan_to_asset_ratio_bin'] = train['loan_to_asset_ratio_bin'].astype(int)test['loan_to_asset_ratio_bin'] = test['loan_to_asset_ratio_bin'].astype(int)train['asset_cost_bin_mean_target'] = train['asset_cost_bin_mean_target'].astype(float)test['asset_cost_bin_mean_target'] = test['asset_cost_bin_mean_target'].astype(float)# 模型训练：linux和mac的xgboost后果会有些许不同，以模型文件后果为主gbms_xgb, oof_preds_xgb, test_preds_xgb = train_xgb(train.copy(), test.copy(),                                                    feat_cols=SAVE_FEATS,                                                    label_col='loan_default')gbms_lgb, oof_preds_lgb, test_preds_lgb = train_lgb(train, test,                                                    feat_cols=SAVE_FEATS,                                                    label_col='loan_default')

2、划分阈值

因为是0-1二分类，最终分类的均值，可近似了解为取到loan_default=1的概率。
再通过对cv的预测后果排序，取分位数（1-P(loan_default=1)）对应的概率为预测正负样本的划分的临界点。

为了让后果更精准，采取小步长遍历临界点左近的点，找到部分最优的概率阈值。

def gen_thres_new(df_train, oof_preds):    df_train['oof_preds'] = oof_preds    # 可看作训练集取到loan_default=1的概率    quantile_point = df_train['loan_default'].mean()     thres = df_train['oof_preds'].quantile(1 - quantile_point)     # 比方 0,1,1,1 mean=0.75 1-mean=0.25,也就是25%分位数取值为0    _thresh = []     #  依照实践阈值的高低0.2范畴，0.01步长，找到最佳阈值，f1分数最高对应的阈值即为最佳阈值    for thres_item in np.arange(thres - 0.2, thres + 0.2, 0.01):        _thresh.append(            [thres_item, f1_score(df_train['loan_default'], np.where(oof_preds > thres_item, 1, 0), average='macro')])    _thresh = np.array(_thresh)    best_id = _thresh[:, 1].argmax() # 找到f1最高对应的行    best_thresh = _thresh[best_id][0] # 取出最佳阈值    print("阈值: {}\n训练集的f1: {}".format(best_thresh, _thresh[best_id][1]))    return best_thresh

3、模型交融

对xgb、lgb的模型cv后果的分位数进行加权求和，再去找交融后的模型0-1的概率阈值。

xgb_thres = gen_thres_new(train, oof_preds_xgb)lgb_thres =  gen_thres_new(train, oof_preds_lgb)# 后果聚合df_oof_res = pd.DataFrame({'customer_id': train['customer_id'],                            'loan_default':train['loan_default'],                            'oof_preds_xgb': oof_preds_xgb,                            'oof_preds_lgb': oof_preds_lgb})# 模型交融df_oof_res['xgb_rank'] = df_oof_res['oof_preds_xgb'].rank(pct=True) # percentile rank,返回的是排序后的分位数df_oof_res['lgb_rank'] = df_oof_res['oof_preds_lgb'].rank(pct=True)df_oof_res['preds'] = 0.31 * df_oof_res['xgb_rank'] + 0.69 * df_oof_res['lgb_rank']# 交融后的模型，概率阈值thres = gen_thres_new(df_oof_res, df_oof_res['preds'])

预测

依照融模后训练集的概率阈值，对测试集预测后果进行0-1划分，输入最终预测提交后果。

def gen_submit_file(df_test, test_preds, thres, save_path):    # 按最终模型交融后的阈值进行划分    df_test['test_preds_binary'] = np.where(test_preds > thres, 1, 0)      df_test_submit = df_test[['customer_id', 'test_preds_binary']]    df_test_submit.columns = ['customer_id', 'loan_default']    print(f'saving result to: {save_path}')    df_test_submit.to_csv(save_path, index=False)    print('done!')    return df_test_submitdf_test_res = pd.DataFrame({'customer_id': test['customer_id'],                                'test_preds_xgb': test_preds_xgb,                                'test_preds_lgb': test_preds_lgb})df_test_res['xgb_rank'] = df_test_res['test_preds_xgb'].rank(pct=True)df_test_res['lgb_rank'] = df_test_res['test_preds_lgb'].rank(pct=True)df_test_res['preds'] = 0.31 * df_test_res['xgb_rank'] + 0.69 * df_test_res['lgb_rank']# 后果产出df_submit = gen_submit_file(df_test_res, df_test_res['preds'], thres,                            save_path='../prediction_result/result.csv')

总结

大佬的代码格调清晰、简洁，看代码十分晦涩，思路也十分清晰，能够好好学习这些工程化的代码，可拓展性强，不便debug。

从赛题角度看，对业务的思考后从id集中度上做了一个“近邻欺诈特色”；在融模操作上，按预测值的ranking值分位数加权。这些小技巧都是可间接复用的~（也是大佬提到的上分点）

上面2个问题，预计很多同学和我一样也都会有些纳闷，我就从b乎间接截图进去：

源码：https://github.com/WangliLin/...

另外，我也整顿了个ipynb，不便学习，须要的同学公众号后盾回复“1208”获取

参考：

logging模块
pickle模块
tqdm模块
Target Encoding公式
Target Encoding
https://zhuanlan.zhihu.com/p/...

欢送关注集体公众号：Distinct数说