2021 科大讯飞 - 车辆贷款守约预测挑战赛 Top1– 计划学习
简介
车贷守约预测问题,目标是建设危险辨认模型来预测可能守约的借款人。预测后果为借款人是否可能守约,属于二分类问题。
偏 数据挖掘
的较量,关键点是 如何基于对数据的了解形象演绎出有用的特色
。
站在大佬的视角,尝试学习总结,站在伟人的肩膀上,兴许看得会更远一些。
间接进入主题,开始学习套路,芜湖~
特色工程
1、罕用库、数据导入
import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import roc_auc_score, auc, roc_curve, accuracy_score, f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler, QuantileTransformer, KBinsDiscretizer, LabelEncoder, MinMaxScaler, PowerTransformer
from tqdm import tqdm
import pickle
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import os
后半局部用了一些工具:
- tqdm:一个优雅的进度条显示,不便观测跑数进度以及速度;
- pickle:将对象以文件的模式寄存在磁盘上,简直所有的数据类型都能够用 pickle 来序列化,个别先 dump,后 load,相似于写出、导入的意思;作用是,一次后果屡次复用,防止反复做功,hhh,比如说 A 列数据处理得花 2h,每次批改过后需重跑其余列数据,但毋庸批改 A 列数据,就能够用 pickle 解决这个问题,疾速调取之前的后果;
- logging:控制台输入日志,不便查看运行状态;
logging.info('data loading...')
train = pd.read_csv('../xfdata/ 车辆贷款守约预测数据集 /train.csv')
test = pd.read_csv('../xfdata/ 车辆贷款守约预测数据集 /test.csv')
2、特色工程
2.1 结构特色
针对训练集、测试集:
- 依据业务了解,计算新的特色;
- 对某些比例特色进行
等宽分箱
(cut),对某些数值特色进行等频分箱
(qcut),还有一些数值特色进行自定义分箱,划分 bin 的范畴;
def gen_new_feats(train, test):
'''生成新特色:如年利率 / 分箱等特色'''
# Step 1: 合并训练集和测试集
data = pd.concat([train, test])
# Step 2: 具体特色工程
# 计算二级账户的年利率
data['sub_Rate'] = (data['sub_account_monthly_payment'] * data['sub_account_tenure'] - data['sub_account_sanction_loan']) / data['sub_account_sanction_loan']
# 计算主账户的年利率
data['main_Rate'] = (data['main_account_monthly_payment'] * data['main_account_tenure'] - data['main_account_sanction_loan']) / data['main_account_sanction_loan']
# 对局部特色进行分箱操作
# 等宽分箱
loan_to_asset_ratio_labels = [i for i in range(10)]
data['loan_to_asset_ratio_bin'] = pd.cut(data["loan_to_asset_ratio"], 10, labels=loan_to_asset_ratio_labels)
# 等频分箱
data['asset_cost_bin'] = pd.qcut(data['asset_cost'], 10, labels=loan_to_asset_ratio_labels)
# 自定义分箱
amount_cols = [
'total_monthly_payment',
'main_account_sanction_loan',
'main_account_disbursed_loan',
'sub_account_sanction_loan',
'sub_account_disbursed_loan',
'main_account_monthly_payment',
'sub_account_monthly_payment',
'total_sanction_loan'
]
amount_labels = [i for i in range(10)]
for col in amount_cols:
total_monthly_payment_bin = [-1, 5000, 10000, 30000, 50000, 100000, 300000, 500000, 1000000, 3000000, data[col].max()]
data[col + '_bin'] = pd.cut(data[col], total_monthly_payment_bin, labels=amount_labels).astype(int)
# Step 3: 返回蕴含新特色的训练集 & 测试集
return data[data['loan_default'].notnull()], data[data['loan_default'].isnull()]
2.2 编码 -Target Encoding
Target encoding 是一种联合目标值进行特色编码的形式。
在二分类中,对于特色 i,target encoding 在该特色取值为 k 时的编码值为类别 k 对应的目标值冀望 E(y|xi=xik)。
在样本集中一共有 10 条记录,其中 3 条记录中特色 Trend 的取值为 Up,咱们关注这 3 条记录。在 k =Up 时,目标值的冀望为 2 /3 ≈ 0.66,所以将 Up 编码为 0.66。
大佬前面次要是针对 id 特色进行 target encoding。
def gen_target_encoding_feats(train, test, encode_cols, target_col, n_fold=10):
'''生成 target encoding 特色'''
# for training set - cv
tg_feats = np.zeros((train.shape[0], len(encode_cols)))
kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True)
for _, (train_index, val_index) in enumerate(kfold.split(train[encode_cols], train[target_col])):
df_train, df_val = train.iloc[train_index], train.iloc[val_index]
for idx, col in enumerate(encode_cols):
target_mean_dict = df_train.groupby(col)[target_col].mean()
df_val[f'{col}_mean_target'] = df_val[col].map(target_mean_dict)
tg_feats[val_index, idx] = df_val[f'{col}_mean_target'].values
for idx, encode_col in enumerate(encode_cols):
train[f'{encode_col}_mean_target'] = tg_feats[:, idx]
# for testing set
for col in encode_cols:
target_mean_dict = train.groupby(col)[target_col].mean()
test[f'{col}_mean_target'] = test[col].map(target_mean_dict)
return train, test
说实话,这段代码还没齐全看明确~ 先用小本本记着,用的时候先间接掏出来,hhh
2.3 近邻欺诈特色
对于风控账户来说,存在危险的账户可能存在同批大量的注册状况,所以 id 可能是连着的。
这里大佬构建了近邻欺诈特色,就是每个账号的前后 10 个账户的 lable 取均值,也就代表着概率,意为可能守约账户汇集的概率,在肯定水平上代表该账户可能守约的相关性。
def gen_neighbor_feats(train, test):
'''产生近邻欺诈特色'''
if not os.path.exists('../user_data/neighbor_default_probs.pkl'):
# 该特色须要跑的工夫较久,因而将其存成了 pkl 文件
neighbor_default_probs = []
for i in tqdm(range(train.customer_id.max())):
if i >= 10 and i < 199706:
customer_id_neighbors = list(range(i - 10, i)) + list(range(i + 1, i + 10))
elif i < 199706:
customer_id_neighbors = list(range(0, i)) + list(range(i + 1, i + 10))
else:
customer_id_neighbors = list(range(i - 10, i)) + list(range(i + 1, 199706))
customer_id_neighbors = [customer_id_neighbor for customer_id_neighbor in customer_id_neighbors if
customer_id_neighbor in train.customer_id.values.tolist()]
neighbor_default_prob = train.set_index('customer_id').loc[customer_id_neighbors].loan_default.mean()
neighbor_default_probs.append(neighbor_default_prob)
df_neighbor_default_prob = pd.DataFrame({'customer_id': range(0, train.customer_id.max()),
'neighbor_default_prob': neighbor_default_probs})
save_pkl(df_neighbor_default_prob, '../user_data/neighbor_default_probs.pkl')
else:
df_neighbor_default_prob = load_pkl('../user_data/neighbor_default_probs.pkl')
train = pd.merge(left=train, right=df_neighbor_default_prob, on='customer_id', how='left')
test = pd.merge(left=test, right=df_neighbor_default_prob, on='customer_id', how='left')
return train, test
2.4 特色工程后果输入
TARGET_ENCODING_FETAS = [
'employment_type',
'branch_id',
'supplier_id',
'manufacturer_id',
'area_id',
'employee_code_id',
'asset_cost_bin'
]
# 特色工程
logging.info('feature generating...')
train, test = gen_new_feats(train, test)
train, test = gen_target_encoding_feats(train, test, TARGET_ENCODING_FETAS, target_col='loan_default', n_fold=10)
train, test = gen_neighbor_feats(train, test)
特色的后续解决,比方一些转换后特色的数据类型转换,一些率值特色的简化,不便后续的模型学习,加强模型的鲁棒性。
# 保留的最终特色名称列表
SAVE_FEATS = [
'customer_id',
'neighbor_default_prob',
'disbursed_amount',
'asset_cost',
'branch_id',
'supplier_id',
'manufacturer_id',
'area_id',
'employee_code_id',
'credit_score',
'loan_to_asset_ratio',
'year_of_birth',
'age',
'sub_Rate',
'main_Rate',
'loan_to_asset_ratio_bin',
'asset_cost_bin',
'employment_type_mean_target',
'branch_id_mean_target',
'supplier_id_mean_target',
'manufacturer_id_mean_target',
'area_id_mean_target',
'employee_code_id_mean_target',
'asset_cost_bin_mean_target',
'credit_history',
'average_age',
'total_disbursed_loan',
'main_account_disbursed_loan',
'total_sanction_loan',
'main_account_sanction_loan',
'active_to_inactive_act_ratio',
'total_outstanding_loan',
'main_account_outstanding_loan',
'Credit_level',
'outstanding_disburse_ratio',
'total_account_loan_no',
'main_account_tenure',
'main_account_loan_no',
'main_account_monthly_payment',
'total_monthly_payment',
'main_account_active_loan_no',
'main_account_inactive_loan_no',
'sub_account_inactive_loan_no',
'enquirie_no',
'main_account_overdue_no',
'total_overdue_no',
'last_six_month_defaulted_no'
]
# 特色工程 后处理
# 简化特色
for col in ['sub_Rate', 'main_Rate', 'outstanding_disburse_ratio']:
train[col] = train[col].apply(lambda x: 1 if x > 1 else x)
test[col] = test[col].apply(lambda x: 1 if x > 1 else x)
# 数据类型转换
train['asset_cost_bin'] = train['asset_cost_bin'].astype(int)
test['asset_cost_bin'] = test['asset_cost_bin'].astype(int)
train['loan_to_asset_ratio_bin'] = train['loan_to_asset_ratio_bin'].astype(int)
test['loan_to_asset_ratio_bin'] = test['loan_to_asset_ratio_bin'].astype(int)
# 存储蕴含新特色的数据集
logging.info('new data saving...')
cols = SAVE_FEATS + ['loan_default',]
train[cols].to_csv('./train_final.csv', index=False)
test[cols].to_csv('./test_final.csv', index=False)
模型构建
1、模型训练 - 穿插验证
采纳 lightgbm、xgboost 两种梯度晋升树模型,这里不多解释了,上面代码都成了“规范”,DDDD~
def train_lgb_kfold(X_train, y_train, X_test, n_fold=5):
'''train lightgbm with k-fold split'''
gbms = []
kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True)
oof_preds = np.zeros((X_train.shape[0],))
test_preds = np.zeros((X_test.shape[0],))
for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)):
logging.info(f'############ fold {fold} ###########')
X_tr, X_val, y_tr, y_val = X_train.iloc[train_index], X_train.iloc[val_index], y_train[train_index], y_train[val_index]
dtrain = lgb.Dataset(X_tr, y_tr)
dvalid = lgb.Dataset(X_val, y_val, reference=dtrain)
params = {
'objective': 'binary',
'metric': 'auc',
'num_leaves': 64,
'learning_rate': 0.02,
'min_data_in_leaf': 150,
'feature_fraction': 0.8,
'bagging_fraction': 0.7,
'n_jobs': -1,
'seed': 1024
}
gbm = lgb.train(params,
dtrain,
num_boost_round=1000,
valid_sets=[dtrain, dvalid],
verbose_eval=50,
early_stopping_rounds=20)
oof_preds[val_index] = gbm.predict(X_val, num_iteration=gbm.best_iteration)
test_preds += gbm.predict(X_test, num_iteration=gbm.best_iteration) / kfold.n_splits
gbms.append(gbm)
return gbms, oof_preds, test_preds
def train_xgb_kfold(X_train, y_train, X_test, n_fold=5):
'''train xgboost with k-fold split'''
gbms = []
kfold = StratifiedKFold(n_splits=10, random_state=1024, shuffle=True)
oof_preds = np.zeros((X_train.shape[0],))
test_preds = np.zeros((X_test.shape[0],))
for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)):
logging.info(f'############ fold {fold} ###########')
X_tr, X_val, y_tr, y_val = X_train.iloc[train_index], X_train.iloc[val_index], y_train[train_index], y_train[val_index]
dtrain = xgb.DMatrix(X_tr, y_tr)
dvalid = xgb.DMatrix(X_val, y_val)
dtest = xgb.DMatrix(X_test)
params={
'booster':'gbtree',
'objective': 'binary:logistic',
'eval_metric': ['logloss', 'auc'],
'max_depth': 8,
'subsample':0.9,
'min_child_weight': 10,
'colsample_bytree':0.85,
'lambda': 10,
'eta': 0.02,
'seed': 1024
}
watchlist = [(dtrain, 'train'), (dvalid, 'test')]
gbm = xgb.train(params,
dtrain,
num_boost_round=1000,
evals=watchlist,
verbose_eval=50,
early_stopping_rounds=20)
oof_preds[val_index] = gbm.predict(dvalid, iteration_range=(0, gbm.best_iteration))
test_preds += gbm.predict(dtest, iteration_range=(0, gbm.best_iteration)) / kfold.n_splits
gbms.append(gbm)
return gbms, oof_preds, test_preds
def train_xgb(train, test, feat_cols, label_col, n_fold=10):
'''训练 xgboost'''
for col in ['sub_Rate', 'main_Rate', 'outstanding_disburse_ratio']:
train[col] = train[col].apply(lambda x: 1 if x > 1 else x)
test[col] = test[col].apply(lambda x: 1 if x > 1 else x)
X_train = train[feat_cols]
y_train = train[label_col]
X_test = test[feat_cols]
gbms_xgb, oof_preds_xgb, test_preds_xgb = train_xgb_kfold(X_train, y_train, X_test, n_fold=n_fold)
if not os.path.exists('../user_data/gbms_xgb.pkl'):
save_pkl(gbms_xgb, '../user_data/gbms_xgb.pkl')
return gbms_xgb, oof_preds_xgb, test_preds_xgb
def train_lgb(train, test, feat_cols, label_col, n_fold=10):
'''训练 lightgbm'''
X_train = train[feat_cols]
y_train = train[label_col]
X_test = test[feat_cols]
gbms_lgb, oof_preds_lgb, test_preds_lgb = train_lgb_kfold(X_train, y_train, X_test, n_fold=n_fold)
if not os.path.exists('../user_data/gbms_lgb.pkl'):
save_pkl(gbms_lgb, '../user_data/gbms_lgb.pkl')
return gbms_lgb, oof_preds_lgb, test_preds_lgb
输入模型训练后果:
# 读取原始数据集
logging.info('data loading...')
train = pd.read_csv('../xfdata/ 车辆贷款守约预测数据集 /train.csv')
test = pd.read_csv('../xfdata/ 车辆贷款守约预测数据集 /test.csv')
# 特色工程
logging.info('feature generating...')
train, test = gen_new_feats(train, test)
train, test = gen_target_encoding_feats(train, test, TARGET_ENCODING_FETAS, target_col='loan_default', n_fold=10)
train, test = gen_neighbor_feats(train, test)
train['asset_cost_bin'] = train['asset_cost_bin'].astype(int)
test['asset_cost_bin'] = test['asset_cost_bin'].astype(int)
train['loan_to_asset_ratio_bin'] = train['loan_to_asset_ratio_bin'].astype(int)
test['loan_to_asset_ratio_bin'] = test['loan_to_asset_ratio_bin'].astype(int)
train['asset_cost_bin_mean_target'] = train['asset_cost_bin_mean_target'].astype(float)
test['asset_cost_bin_mean_target'] = test['asset_cost_bin_mean_target'].astype(float)
# 模型训练:linux 和 mac 的 xgboost 后果会有些许不同,以模型文件后果为主
gbms_xgb, oof_preds_xgb, test_preds_xgb = train_xgb(train.copy(), test.copy(),
feat_cols=SAVE_FEATS,
label_col='loan_default')
gbms_lgb, oof_preds_lgb, test_preds_lgb = train_lgb(train, test,
feat_cols=SAVE_FEATS,
label_col='loan_default')
2、划分阈值
因为是 0- 1 二分类
,最终分类的均值,可近似了解为取到 loan_default= 1 的概率。
再通过对 cv 的预测后果排序,取分位数(1-P(loan_default=1))对应的概率为预测正负样本的划分的临界点。
为了让后果更精准,采取小步长遍历临界点左近的点,找到部分最优的概率阈值。
def gen_thres_new(df_train, oof_preds):
df_train['oof_preds'] = oof_preds
# 可看作训练集取到 loan_default= 1 的概率
quantile_point = df_train['loan_default'].mean()
thres = df_train['oof_preds'].quantile(1 - quantile_point)
# 比方 0,1,1,1 mean=0.75 1-mean=0.25, 也就是 25% 分位数取值为 0
_thresh = []
# 依照实践阈值的高低 0.2 范畴,0.01 步长,找到最佳阈值,f1 分数最高对应的阈值即为最佳阈值
for thres_item in np.arange(thres - 0.2, thres + 0.2, 0.01):
_thresh.append([thres_item, f1_score(df_train['loan_default'], np.where(oof_preds > thres_item, 1, 0), average='macro')])
_thresh = np.array(_thresh)
best_id = _thresh[:, 1].argmax() # 找到 f1 最高对应的行
best_thresh = _thresh[best_id][0] # 取出最佳阈值
print("阈值: {}\n 训练集的 f1: {}".format(best_thresh, _thresh[best_id][1]))
return best_thresh
3、模型交融
对 xgb、lgb 的模型 cv 后果的分位数进行 加权求和
,再去找交融后的模型 0 - 1 的概率阈值。
xgb_thres = gen_thres_new(train, oof_preds_xgb)
lgb_thres = gen_thres_new(train, oof_preds_lgb)
# 后果聚合
df_oof_res = pd.DataFrame({'customer_id': train['customer_id'],
'loan_default':train['loan_default'],
'oof_preds_xgb': oof_preds_xgb,
'oof_preds_lgb': oof_preds_lgb})
# 模型交融
df_oof_res['xgb_rank'] = df_oof_res['oof_preds_xgb'].rank(pct=True) # percentile rank, 返回的是排序后的分位数
df_oof_res['lgb_rank'] = df_oof_res['oof_preds_lgb'].rank(pct=True)
df_oof_res['preds'] = 0.31 * df_oof_res['xgb_rank'] + 0.69 * df_oof_res['lgb_rank']
# 交融后的模型,概率阈值
thres = gen_thres_new(df_oof_res, df_oof_res['preds'])
预测
依照融模后训练集的概率阈值,对测试集预测后果进行 0 - 1 划分,输入最终预测提交后果。
def gen_submit_file(df_test, test_preds, thres, save_path):
# 按最终模型交融后的阈值进行划分
df_test['test_preds_binary'] = np.where(test_preds > thres, 1, 0)
df_test_submit = df_test[['customer_id', 'test_preds_binary']]
df_test_submit.columns = ['customer_id', 'loan_default']
print(f'saving result to: {save_path}')
df_test_submit.to_csv(save_path, index=False)
print('done!')
return df_test_submit
df_test_res = pd.DataFrame({'customer_id': test['customer_id'],
'test_preds_xgb': test_preds_xgb,
'test_preds_lgb': test_preds_lgb})
df_test_res['xgb_rank'] = df_test_res['test_preds_xgb'].rank(pct=True)
df_test_res['lgb_rank'] = df_test_res['test_preds_lgb'].rank(pct=True)
df_test_res['preds'] = 0.31 * df_test_res['xgb_rank'] + 0.69 * df_test_res['lgb_rank']
# 后果产出
df_submit = gen_submit_file(df_test_res, df_test_res['preds'], thres,
save_path='../prediction_result/result.csv')
总结
大佬的代码格调清晰、简洁,看代码十分晦涩,思路也十分清晰,能够好好学习这些工程化的代码,可拓展性强,不便 debug。
从赛题角度看,对业务的思考后从 id 集中度上做了一个“近邻欺诈特色”;在融模操作上,按预测值的 ranking 值分位数加权。这些小技巧都是可间接复用的~(也是大佬提到的上分点)
上面 2 个问题,预计很多同学和我一样也都会有些纳闷,我就从 b 乎间接截图进去:
源码:https://github.com/WangliLin/…
另外,我也整顿了个 ipynb,不便学习,须要的同学公众号后盾回复“1208”获取
参考:
- logging 模块
- pickle 模块
- tqdm 模块
- Target Encoding 公式
- Target Encoding
- https://zhuanlan.zhihu.com/p/…
欢送关注集体公众号:Distinct 数说