关于机器学习:机器学习系列入门系列七基于英雄联盟数据集的LightGBM的分类预测

本我的项目链接：https://www.heywhale.com/home/column/64141d6b1c8c8b518ba97dcc

LightGBM 是 2017 年由微软推出的可扩大机器学习零碎，是微软旗下 DMKT 的一个开源我的项目，它是一款基于 GBDT（梯度晋升决策树）算法的分布式梯度晋升框架，为了满足缩短模型计算工夫的需要，LightGBM 的设计思路次要集中在减小数据对内存与计算性能的应用，以及缩小多机器并行计算时的通信代价。

LightGBM 能够看作是 XGBoost 的降级豪华版，在取得与 XGBoost 近似精度的同时，又提供了更快的训练速度与更少的内存耗费。正如其名字中的 Light 所蕴含的那样，LightGBM 在大规模数据集上跑起来更加优雅轻捷，一经推出便成为各种数据比赛中刷榜夺冠的神兵利器。

LightGBM 底层实现了 GBDT 算法，并且增加了一系列的新个性：

基于直方图算法进行优化，使数据存储更加不便、运算更快、鲁棒性强、模型更加稳固等。
提出了带深度限度的 Leaf-wise 算法，摈弃了大多数 GBDT 工具应用的按层成长 (level-wise) 的决策树成长策略，而应用了带有深度限度的按叶子成长策略，能够升高误差，失去更好的精度。
提出了单边梯度采样算法，排除大部分小梯度的样本，仅用剩下的样本计算信息增益，它是一种在缩小数据量和保障精度上均衡的算法。
提出了互斥特色捆绑算法，高维度的数据往往是稠密的，这种稠密性启发咱们设计一种无损的办法来缩小特色的维度。通常被捆绑的特色都是互斥的（即特色不会同时为非零值，像 one-hot），这样两个特色捆绑起来就不会失落信息。

LightGBM 是基于 CART 树的集成模型，它的思维是串联多个决策树模型独特进行决策。

那么如何串联呢？LightGBM 采纳迭代预测误差的办法串联。举个艰深的例子，咱们当初须要预测一辆车价值 3000 元。咱们构建决策树 1 训练后预测为 2600 元，咱们发现有 400 元的误差，那么决策树 2 的训练指标为 400 元，但决策树 2 的预测后果为 350 元，还存在 50 元的误差就交给第三棵树……以此类推，每一颗树用来预计之前所有树的误差，最初所有树预测后果的求和就是最终预测后果！

LightGBM 的基模型是 CART 回归树，它有两个特点：（1）CART 树，是一颗二叉树。（2）回归树，最初拟合后果是间断值。

LightGBM 模型能够示意为以下模式，咱们约定 $f_t(x)$ 示意前 $t$ 颗树的和，$h_t(x)$ 示意第 $t$ 颗决策树，模型定义如下：

$f_{t}(x)=\sum_{t=1}^{T} h_{t}(x)$

因为模型递归生成，第 $t$ 步的模型由第 $t-1$ 步的模型造成，能够写成：

$f_{t}(x)=f_{t-1}(x)+h_{t}(x)$

每次须要加上的树 $h_t(x)$ 是之前树求和的误差：

$r_{t, i}=y_{i}-f_{m-1}\left(x_{i}\right)$

咱们每一步只有拟合一颗输入为 $r_{t,i}$ 的 CART 树加到 $f_{t-1}(x)$ 就能够了。

LightGBM 在机器学习与数据挖掘畛域有着极为宽泛的利用。据统计 LightGBM 模型自 2016 到 2019 年在 Kaggle 平台上累积取得数据比赛前三名三十余次，其中包含 CIKM2017 AnalytiCup、IEEE Fraud Detection 等出名比赛。这些比赛来源于各行各业的实在业务，这些比赛问题表明 LightGBM 具备很好的可扩展性，在各类不同问题上都能够获得十分好的成果。

同时，LightGBM 还被胜利利用在工业界与学术界的各种问题中。例如金融风控、购买行为辨认、交通流量预测、环境声音分类、基因分类、生物成分剖析等诸多畛域。尽管畛域相干的数据分析和个性工程在这些解决方案中也施展了重要作用，但学习者与实践者对 LightGBM 的统一抉择表明了这一软件包的影响力与重要性。

理解 LightGBM 的参数与相干常识
把握 LightGBM 的 Python 调用并将其使用到英雄联盟游戏输赢预测数据集上

Part1 基于英雄联盟数据集的 LightGBM 分类实际

Step1: 库函数导入
Step2: 数据读取 / 载入
Step3: 数据信息简略查看
Step4: 可视化形容
Step5: 利用 LightGBM 进行训练与预测
Step6: 利用 LightGBM 进行特征选择
Step7: 通过调整参数取得更好的成果

在实际的最开始，咱们首先须要导入一些根底的函数库包含：numpy（Python 进行科学计算的根底软件包），pandas（pandas 是一种疾速，弱小，灵便且易于应用的开源数据分析和解决工具），matplotlib 和 seaborn 绘图。

# 下载须要用到的数据集
!wget https://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/8LightGBM/high_diamond_ranked_10min.csv

--2023-03-22 19:33:36--  https://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/8LightGBM/high_diamond_ranked_10min.csv
正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)... 49.7.22.39
正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)|49.7.22.39|:443... 已连贯。已收回 HTTP 申请，正在期待回应... 200 OK
长度：1446502 (1.4M) 
正在保留至:“high_diamond_ranked_10min.csv”high_diamond_ranked 100%[===================>]   1.38M  --.-KB/s    in 0.04s   

2023-03-22 19:33:36 (38.3 MB/s) - 已保留“high_diamond_ranked_10min.csv”[1446502/1446502])

Step1：函数库导入

##  根底函数库
import numpy as np 
import pandas as pd

## 绘图函数库
import matplotlib.pyplot as plt
import seaborn as sns

本次咱们抉择英雄联盟数据集进行 LightGBM 的场景体验。英雄联盟是 2009 年美国拳头游戏开发的 MOBA 竞技网游，在每局较量中蓝队与红队在同一个地图进行作战，游戏的指标是毁坏敌方队伍的进攻塔，进而捣毁敌方的水晶枢纽，拿下较量的胜利。

当初共有 9881 场英雄联盟韩服钻石段位以上的排位较量数据，数据提供了在十分钟时的游戏状态，包含击杀数、死亡数、金币数量、经验值、等级……等信息。列 blueWins 是数据的标签，代表了本场较量是否为蓝队获胜。

数据的各个特征描述如下：

特色名称	特色意义	取值范畴
WardsPlaced	插眼数量	整数
WardsDestroyed	拆眼数量	整数
FirstBlood	是否取得首次击杀	整数
Kills	击杀英雄数量	整数
Deaths	死亡数量	整数
Assists	助攻数量	整数
EliteMonsters	击杀大型野怪数量	整数
Dragons	击杀史诗野怪数量	整数
Heralds	击杀峡谷先锋数量	整数
TowersDestroyed	推塔数量	整数
TotalGold	总经济	整数
AvgLevel	均匀英雄等级	浮点数
TotalExperience	英雄总教训	整数
TotalMinionsKilled	英雄补兵数量	整数
TotalJungleMinionsKilled	英雄击杀野怪数量	整数
GoldDiff	经济差距	整数
ExperienceDiff	教训差距	整数
CSPerMin	分均补刀	浮点数
GoldPerMin	分均经济	浮点数

Step2：数据读取 / 载入

## 咱们利用 Pandas 自带的 read_csv 函数读取并转化为 DataFrame 格局

df = pd.read_csv('./high_diamond_ranked_10min.csv')
y = df.blueWins

Step3：数据信息简略查看

## 利用.info() 查看数据的整体信息
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9879 entries, 0 to 9878
Data columns (total 40 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   gameId                        9879 non-null   int64  
 1   blueWins                      9879 non-null   int64  
 2   blueWardsPlaced               9879 non-null   int64  
 3   blueWardsDestroyed            9879 non-null   int64  
 4   blueFirstBlood                9879 non-null   int64  
 5   blueKills                     9879 non-null   int64  
 6   blueDeaths                    9879 non-null   int64  
 7   blueAssists                   9879 non-null   int64  
 8   blueEliteMonsters             9879 non-null   int64  
 9   blueDragons                   9879 non-null   int64  
 10  blueHeralds                   9879 non-null   int64  
 11  blueTowersDestroyed           9879 non-null   int64  
 12  blueTotalGold                 9879 non-null   int64  
 13  blueAvgLevel                  9879 non-null   float64
 14  blueTotalExperience           9879 non-null   int64  
 15  blueTotalMinionsKilled        9879 non-null   int64  
 16  blueTotalJungleMinionsKilled  9879 non-null   int64  
 17  blueGoldDiff                  9879 non-null   int64  
 18  blueExperienceDiff            9879 non-null   int64  
 19  blueCSPerMin                  9879 non-null   float64
 20  blueGoldPerMin                9879 non-null   float64
 21  redWardsPlaced                9879 non-null   int64  
 22  redWardsDestroyed             9879 non-null   int64  
 23  redFirstBlood                 9879 non-null   int64  
 24  redKills                      9879 non-null   int64  
 25  redDeaths                     9879 non-null   int64  
 26  redAssists                    9879 non-null   int64  
 27  redEliteMonsters              9879 non-null   int64  
 28  redDragons                    9879 non-null   int64  
 29  redHeralds                    9879 non-null   int64  
 30  redTowersDestroyed            9879 non-null   int64  
 31  redTotalGold                  9879 non-null   int64  
 32  redAvgLevel                   9879 non-null   float64
 33  redTotalExperience            9879 non-null   int64  
 34  redTotalMinionsKilled         9879 non-null   int64  
 35  redTotalJungleMinionsKilled   9879 non-null   int64  
 36  redGoldDiff                   9879 non-null   int64  
 37  redExperienceDiff             9879 non-null   int64  
 38  redCSPerMin                   9879 non-null   float64
 39  redGoldPerMin                 9879 non-null   float64
dtypes: float64(6), int64(34)
memory usage: 3.0 MB

## 进行简略的数据查看，咱们能够利用 .head() 头部.tail() 尾部
df.head()

.dataframe tbody tr th:only-of-type {vertical-align: middle;}

.dataframe tbody tr th {vertical-align: top;}

.dataframe thead th {text-align: right;}

</style>

	gameId	blueWardsPlaced	blueWardsDestroyed	blueFirstBlood	blueKills	blueDeaths	blueAssists	blueEliteMonsters	blueDragons	…	redTowersDestroyed	redTotalGold	redAvgLevel	redTotalExperience	redTotalMinionsKilled	redTotalJungleMinionsKilled	redGoldDiff	redExperienceDiff	redCSPerMin	redGoldPerMin
0	4519157822	28	2	1	9	6	11	0	0	…	0	16567	6.8	17047	197	55	-643	8	19.7	1656.7
1	4523371949	12	1	0	5	5	5	0	0	…	1	17620	6.8	17438	240	52	2908	1173	24.0	1762.0
2	4521474530	15	0	0	7	11	4	1	1	…	0	17285	6.8	17254	203	28	1172	1033	20.3	1728.5
3	4524384067	43	1	0	4	5	5	1	0	…	0	16478	7.0	17961	235	47	1321	7	23.5	1647.8
4	4436033771	75	4	0	6	6	6	0	0	…	0	17404	7.0	18313	225	67	1004	-230	22.5	1740.4

<p>5 rows × 40 columns</p>
</div>

df.tail()

.dataframe tbody tr th:only-of-type {vertical-align: middle;}

.dataframe tbody tr th {vertical-align: top;}

.dataframe thead th {text-align: right;}

</style>

	gameId	blueWins	blueWardsPlaced	blueWardsDestroyed	blueFirstBlood	blueKills	blueDeaths	blueAssists	blueEliteMonsters	blueDragons	…	redTotalGold	redAvgLevel	redTotalExperience	redTotalMinionsKilled	redTotalJungleMinionsKilled	redGoldDiff	redExperienceDiff	redCSPerMin	redGoldPerMin
9874	4527873286	1	17	2	1	7	4	5	1	1	…	15246	6.8	16498	229	34	-2519	-2469	22.9	1524.6
9875	4527797466	1	54	0	0	6	4	8	1	1	…	15456	7.0	18367	206	56	-782	-888	20.6	1545.6
9876	4527713716	0	23	1	0	6	7	5	0	0	…	18319	7.4	19909	261	60	2416	1877	26.1	1831.9
9877	4527628313	0	14	4	1	2	3	3	1	1	…	15298	7.2	18314	247	40	839	1085	24.7	1529.8
9878	4523772935	1	18	0	1	6	6	5	0	0	…	15339	6.8	17379	201	46	-927	58	20.1	1533.9

<p>5 rows × 40 columns</p>
</div>

## 标注标签并利用 value_counts 函数查看训练集标签的数量
y = df.blueWins
y.value_counts()

0    4949
1    4930
Name: blueWins, dtype: int64

数据集正负标签数量基本相同，不存在数据不均衡的问题。

## 标注特色列
drop_cols = ['gameId','blueWins']
x = df.drop(drop_cols, axis=1)

## 对于特色进行一些统计形容
x.describe()

.dataframe tbody tr th:only-of-type {vertical-align: middle;}

.dataframe tbody tr th {vertical-align: top;}

.dataframe thead th {text-align: right;}

</style>

	blueWardsPlaced	blueWardsDestroyed	blueFirstBlood	blueKills	blueDeaths	blueAssists	blueEliteMonsters	blueDragons	blueHeralds	blueTowersDestroyed	…	redTowersDestroyed	redTotalGold	redAvgLevel	redTotalExperience	redTotalMinionsKilled	redTotalJungleMinionsKilled	redGoldDiff	redExperienceDiff	redCSPerMin	redGoldPerMin
count	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	…	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000	9879.000000
mean	22.288288	2.824881	0.504808	6.183925	6.137666	6.645106	0.549954	0.361980	0.187974	0.051422	…	0.043021	16489.041401	6.925316	17961.730438	217.349226	51.313088	-14.414111	33.620306	21.734923	1648.904140
std	18.019177	2.174998	0.500002	3.011028	2.933818	4.064520	0.625527	0.480597	0.390712	0.244369	…	0.216900	1490.888406	0.305311	1198.583912	21.911668	10.027885	2453.349179	1920.370438	2.191167	149.088841
min	5.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	11212.000000	4.800000	10465.000000	107.000000	4.000000	-11467.000000	-8348.000000	10.700000	1121.200000
25%	14.000000	1.000000	0.000000	4.000000	4.000000	4.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	15427.500000	6.800000	17209.500000	203.000000	44.000000	-1596.000000	-1212.000000	20.300000	1542.750000
50%	16.000000	3.000000	1.000000	6.000000	6.000000	6.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	16378.000000	7.000000	17974.000000	218.000000	51.000000	-14.000000	28.000000	21.800000	1637.800000
75%	20.000000	4.000000	1.000000	8.000000	8.000000	9.000000	1.000000	1.000000	0.000000	0.000000	…	0.000000	17418.500000	7.200000	18764.500000	233.000000	57.000000	1585.500000	1290.500000	23.300000	1741.850000
max	250.000000	27.000000	1.000000	22.000000	22.000000	29.000000	2.000000	1.000000	1.000000	4.000000	…	2.000000	22732.000000	8.200000	22269.000000	289.000000	92.000000	10830.000000	9333.000000	28.900000	2273.200000

<p>8 rows × 38 columns</p>
</div>

咱们发现不同对局中插眼数和拆眼数的取值范畴存在显著差距，甚至有前十分钟插了 250 个眼的异样值。
咱们发现 EliteMonsters 的取值相当于 Deagons + Heralds。
咱们发现 TotalGold 等变量在大部分对局中差距不大。
咱们发现两支队伍的经济差和教训差是相反数。
咱们发现红队和蓝队拿到首次击杀的概率大略都是 50%

## 依据下面的形容，咱们能够去除一些反复变量，比方只有晓得蓝队是否拿到一血，咱们就晓得红队有没有拿到，能够去除红队的相干冗余数据。drop_cols = ['redFirstBlood','redKills','redDeaths'
             ,'redGoldDiff','redExperienceDiff', 'blueCSPerMin',
            'blueGoldPerMin','redCSPerMin','redGoldPerMin']
x.drop(drop_cols, axis=1, inplace=True)

Step4: 可视化形容

data = x
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std.iloc[:, 0:9]], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

fig, ax = plt.subplots(1,2,figsize=(15,5))

# 绘制小提琴图
sns.violinplot(x='Features', y='Values', hue='blueWins', data=data, split=True,
               inner='quart', ax=ax[0], palette='Blues')
fig.autofmt_xdate(rotation=45)

data = x
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std.iloc[:, 9:18]], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

# 绘制小提琴图
sns.violinplot(x='Features', y='Values', hue='blueWins', 
               data=data, split=True, inner='quart', ax=ax[1], palette='Blues')
fig.autofmt_xdate(rotation=45)

plt.show()

小提琴图 (Violin Plot) 是用来展现多组数据的散布状态以及概率密度。这种图表联合了箱形图和密度图的特色，次要用来显示数据的散布形态。

从图中咱们能够看出：

击杀英雄数量越多更容易赢，死亡数量越多越容易输（bluekills 与 bluedeaths 左右的区别）。
助攻数量与击杀英雄数量造成的图形态相似，阐明他们对游戏后果的影响差不多。
一血的获得状况与获胜有正相干，然而相关性不如击杀英雄数量显著。
经济差与教训差对于游戏输赢的影响较小。
击杀野怪数量对游戏输赢的影响并不大。

plt.figure(figsize=(18,14))
sns.heatmap(round(x.corr(),2), cmap='Blues', annot=True)
plt.show()

同时咱们画出各个特色之间的相关性热力求，色彩越深代表特色之间相关性越强，咱们剔除那些相关性较强的冗余特色。

# 去除冗余特色
drop_cols = ['redAvgLevel','blueAvgLevel']
x.drop(drop_cols, axis=1, inplace=True)

sns.set(style='whitegrid', palette='muted')

# 结构两个新特色
x['wardsPlacedDiff'] = x['blueWardsPlaced'] - x['redWardsPlaced']
x['wardsDestroyedDiff'] = x['blueWardsDestroyed'] - x['redWardsDestroyed']

data = x[['blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff','wardsDestroyedDiff']].sample(1000)
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

plt.figure(figsize=(10,6))
sns.swarmplot(x='Features', y='Values', hue='blueWins', data=data)
plt.xticks(rotation=45)
plt.show()

咱们画出了插眼数量的散点图，发现不存在插眼数量与游戏输赢间的显著法则。猜想因为钻石分段以上在哪插眼在哪好排眼都是套路，所以数据中前十分钟插眼数拔眼数对游戏的影响不大。所以咱们临时先把这些特色去掉。

## 去除和眼位相干的特色
drop_cols = ['blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff',
            'wardsDestroyedDiff','redWardsPlaced','redWardsDestroyed']
x.drop(drop_cols, axis=1, inplace=True)

x['killsDiff'] = x['blueKills'] - x['blueDeaths']
x['assistsDiff'] = x['blueAssists'] - x['redAssists']

x[['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']].hist(figsize=(12,10), bins=20)
plt.show()

咱们发现击杀、死亡与助攻数的数据分布差异不大。然而击杀减去死亡、助攻减去死亡的散布与原散布差异很大，因而咱们新结构这么两个特色。


data = x[['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']].sample(1000)
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

plt.figure(figsize=(10,6))
sns.swarmplot(x='Features', y='Values', hue='blueWins', data=data)
plt.xticks(rotation=45)
plt.show()

从上图咱们能够发现击杀数与死亡数与助攻数，以及咱们结构的特色对数据都有较好的分类能力。

data = pd.concat([y, x], axis=1).sample(500)

sns.pairplot(data, vars=['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists'], 
             hue='blueWins')

plt.show()

一些特色两两组合后对于数据的划分能力也有晋升。

x['dragonsDiff'] = x['blueDragons'] - x['redDragons']
x['heraldsDiff'] = x['blueHeralds'] - x['redHeralds']
x['eliteDiff'] = x['blueEliteMonsters'] - x['redEliteMonsters']

data = pd.concat([y, x], axis=1)

eliteGroup = data.groupby(['eliteDiff'])['blueWins'].mean()
dragonGroup = data.groupby(['dragonsDiff'])['blueWins'].mean()
heraldGroup = data.groupby(['heraldsDiff'])['blueWins'].mean()

fig, ax = plt.subplots(1,3, figsize=(15,4))

eliteGroup.plot(kind='bar', ax=ax[0])
dragonGroup.plot(kind='bar', ax=ax[1])
heraldGroup.plot(kind='bar', ax=ax[2])

print(eliteGroup)
print(dragonGroup)
print(heraldGroup)

plt.show()

eliteDiff
-2    0.286301
-1    0.368772
 0    0.500683
 1    0.632093
 2    0.735211
Name: blueWins, dtype: float64
dragonsDiff
-1    0.374173
 0    0.500000
 1    0.640940
Name: blueWins, dtype: float64
heraldsDiff
-1    0.387729
 0    0.498680
 1    0.595046
Name: blueWins, dtype: float64

咱们结构了两队之间是否拿到龙、是否拿到峡谷先锋、击杀大型野怪的数量差值，发现在游戏的后期拿到龙比拿到峡谷先锋更容易获得胜利。拿到大型野怪的数量和胜率也存在着强相干。

x['towerDiff'] = x['blueTowersDestroyed'] - x['redTowersDestroyed']

data = pd.concat([y, x], axis=1)

towerGroup = data.groupby(['towerDiff'])['blueWins']
print(towerGroup.count())
print(towerGroup.mean())

fig, ax = plt.subplots(1,2,figsize=(15,5))

towerGroup.mean().plot(kind='line', ax=ax[0])
ax[0].set_title('Proportion of Blue Wins')
ax[0].set_ylabel('Proportion')

towerGroup.count().plot(kind='line', ax=ax[1])
ax[1].set_title('Count of Towers Destroyed')
ax[1].set_ylabel('Count')

towerDiff
-2      27
-1     347
 0    9064
 1     406
 2      28
 3       6
 4       1
Name: blueWins, dtype: int64
towerDiff
-2    0.185185
-1    0.216138
 0    0.498124
 1    0.741379
 2    0.964286
 3    1.000000
 4    1.000000
Name: blueWins, dtype: float64





Text(0,0.5,'Count')

推塔是英雄联盟这个游戏的外围，因而推塔数量可能与游戏的输赢有很大关系。咱们绘图发现，只管前十分钟推掉第一座进攻塔的概率很低，然而一旦某只队伍推掉第一座进攻塔，取得游戏的胜率将大大增加。

Step5：利用 LightGBM 进行训练与预测

## 为了正确评估模型性能，将数据划分为训练集和测试集，并在训练集上训练模型，在测试集上验证模型性能。from sklearn.model_selection import train_test_split

## 抉择其类别为 0 和 1 的样本（不包含类别为 2 的样本）data_target_part = y
data_features_part = x

## 测试集大小为 20%，80%/20% 分
x_train, x_test, y_train, y_test = train_test_split(data_features_part, data_target_part, test_size = 0.2, random_state = 2020)

## 导入 LightGBM 模型
from lightgbm.sklearn import LGBMClassifier
## 定义 LightGBM 模型 
clf = LGBMClassifier()
# 在训练集上训练 LightGBM 模型
clf.fit(x_train, y_train)

LGBMClassifier()

## 在训练集和测试集上散布利用训练好的模型进行预测
train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)
from sklearn import metrics

## 利用 accuracy（准确度）【预测正确的样本数目占总预测样本数目标比例】评估模型成果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## 查看混同矩阵 (预测值和实在值的各类状况统计矩阵)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# 利用热力求对于后果进行可视化
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

The accuracy of the Logistic Regression is: 0.8447425028470201
The accuracy of the Logistic Regression is: 0.722165991902834
The confusion matrix result:
 [[714 300]
 [249 713]]

咱们能够发现共有 718 + 707 个样本预测正确，306 + 245 个样本预测谬误。

Step7: 利用 LightGBM 进行特征选择

LightGBM 的特征选择属于特征选择中的嵌入式办法，在 LightGBM 中能够用属性 feature_importances_去查看特色的重要度。

sns.barplot(y=data_features_part.columns, x=clf.feature_importances_)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcb1048e350>

总经济差距等特色，助攻数量、击杀死亡数量等特色都具备很大的作用。插眼数、推塔数对模型的影响并不大。

首次之外，咱们还能够应用 LightGBM 中的下列重要属性来评估特色的重要性。

gain: 当利用特色做划分的时候的评估基尼指数
split: 是以特色用到的次数来评估

from sklearn.metrics import accuracy_score
from lightgbm import plot_importance

def estimate(model,data):

    #sns.barplot(data.columns,model.feature_importances_)
    ax1=plot_importance(model,importance_type="gain")
    ax1.set_title('gain')
    ax2=plot_importance(model, importance_type="split")
    ax2.set_title('split')
    plt.show()
def classes(data,label,test):
    model=LGBMClassifier()
    model.fit(data,label)
    ans=model.predict(test)
    estimate(model, data)
    return ans
 
ans=classes(x_train,y_train,x_test)
pre=accuracy_score(y_test, ans)
print('acc=',accuracy_score(y_test,ans))




acc= 0.722165991902834

这些图同样能够帮忙咱们更好的理解其余重要特色。

Step8: 通过调整参数取得更好的成果

LightGBM 中包含但不限于下列对模型影响较大的参数：

learning_rate: 有时也叫作 eta，零碎默认值为 0.3。每一步迭代的步长，很重要。太大了运行准确率不高，太小了运行速度慢。
num_leaves：零碎默认为 32。这个参数管制每棵树中最大叶子节点数量。
feature_fraction：零碎默认值为 1。咱们个别设置成 0.8 左右。用来管制每棵随机采样的列数的占比 (每一列是一个特色)。
max_depth：零碎默认值为 6，咱们罕用 3 -10 之间的数字。这个值为树的最大深度。这个值是用来管制过拟合的。max_depth 越大，模型学习的更加具体。

调节模型参数的办法有贪婪算法、网格调参、贝叶斯调参等。这里咱们采纳网格调参，它的根本思维是穷举搜寻：在所有候选的参数抉择中，通过循环遍历，尝试每一种可能性，体现最好的参数就是最终的后果

## 从 sklearn 库中导入网格调参函数
from sklearn.model_selection import GridSearchCV

## 定义参数取值范畴
learning_rate = [0.1, 0.3, 0.6]
feature_fraction = [0.5, 0.8, 1]
num_leaves = [16, 32, 64]
max_depth = [-1,3,5,8]

parameters = { 'learning_rate': learning_rate,
              'feature_fraction':feature_fraction,
              'num_leaves': num_leaves,
              'max_depth': max_depth}
model = LGBMClassifier(n_estimators = 50)

## 进行网格搜寻
clf = GridSearchCV(model, parameters, cv=3, scoring='accuracy',verbose=3, n_jobs=-1)
clf = clf.fit(x_train, y_train)

[CV 1/3] END feature_fraction=1, learning_rate=0.6, max_depth=8, num_leaves=64;, score=0.672 total time=   0.1s
[CV 3/3] END feature_fraction=1, learning_rate=0.6, max_depth=8, num_leaves=64;, score=0.685 total time=   0.1s
[LightGBM] [Warning] feature_fraction is set=1, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=1

## 网格搜寻后的最好参数为

clf.best_params_

{'feature_fraction': 1, 'learning_rate': 0.1, 'max_depth': 3, 'num_leaves': 16}

## 在训练集和测试集上散布利用最好的模型参数进行预测

## 定义带参数的 LightGBM 模型 
clf = LGBMClassifier(feature_fraction = 0.8,
                    learning_rate = 0.1,
                    max_depth= 3,
                    num_leaves = 16)
# 在训练集上训练 LightGBM 模型
clf.fit(x_train, y_train)

train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

## 利用 accuracy（准确度）【预测正确的样本数目占总预测样本数目标比例】评估模型成果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## 查看混同矩阵 (预测值和实在值的各类状况统计矩阵)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# 利用热力求对于后果进行可视化
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

The accuracy of the Logistic Regression is: 0.7440212577502214
The accuracy of the Logistic Regression is: 0.7317813765182186
The confusion matrix result:
 [[722 289]
 [241 724]]

本来有 306 + 245 个谬误，当初有 287 + 230 个谬误，带来了显著的正确率晋升。

num_leaves 参数 这是管制树模型复杂度的主要参数，个别的咱们会使 num_leaves 小于（2 的 max_depth 次方），以避免过拟合。因为 LightGBM 是 leaf-wise 建树与 XGBoost 的 depth-wise 建树办法不同，num_leaves 比 depth 有更大的作用。、
min_data_in_leaf 这是解决过拟合问题中一个十分重要的参数. 它的值取决于训练数据的样本个树和 num_leaves 参数. 将其设置的较大能够防止生成一个过深的树, 但有可能导致欠拟合. 理论利用中, 对于大数据集, 设置其为几百或几千就足够了.
max_depth 树的深度，depth 的概念在 leaf-wise 树中并没有多大作用, 因为并不存在一个从 leaves 到 depth 的正当映射。

通过设置 bagging_fraction 和 bagging_freq 参数来应用 bagging 办法。
通过设置 feature_fraction 参数来应用特色的子抽样。
抉择较小的 max_bin 参数。
应用 save_binary 在将来的学习过程对数据加载进行减速。

应用较大的 max_bin（学习速度可能变慢）
应用较小的 learning_rate 和较大的 num_iterations
应用较大的 num_leaves（可能导致过拟合）
应用更大的训练数据
尝试 dart 模式

应用较小的 max_bin
应用较小的 num_leaves
应用 min_data_in_leaf 和 min_sum_hessian_in_leaf
通过设置 bagging_fraction 和 bagging_freq 来应用 bagging
通过设置 feature_fraction 来应用特色子抽样
应用更大的训练数据
应用 lambda_l1, lambda_l2 和 min_gain_to_split 来应用正则
尝试 max_depth 来防止生成过深的树

LightGBM 的次要长处：

简略易用。提供了支流的 Python\C++\R 语言接口，用户能够轻松应用 LightGBM 建模并取得相当不错的成果。
高效可扩大。在解决大规模数据集时高效迅速、高准确度，对内存等硬件资源要求不高。
鲁棒性强。相较于深度学习模型不须要精密调参便能获得近似的成果。
LightGBM 间接反对缺失值与类别特色，无需对数据额定进行非凡解决

LightGBM 的次要毛病：

绝对于深度学习模型无奈对时空地位建模，不能很好地捕捉图像、语音、文本等高维数据。
在领有海量训练数据，并能找到适合的深度学习模型时，深度学习的精度能够遥遥领先 LightGBM。

本我的项目链接：https://www.heywhale.com/home/column/64141d6b1c8c8b518ba97dcc

参考链接：https://tianchi.aliyun.com/course/278/3424

自己最近打算整合 ML、DRL、NLP 等相干畛域的体系化我的项目课程，不便入门同学疾速把握相干常识。申明：局部我的项目为网络经典我的项目不便大家疾速学习，后续会一直削减实战环节（较量、论文、事实利用等）。

对于机器学习这块布局为：根底入门机器学习算法 —> 简略我的项目实战 —> 数据建模较量 —–> 相干事实中利用场景问题解决。一条路线帮忙大家学习，疾速实战。
对于深度强化学习这块布局为：根底单智能算法教学（gym 环境为主）—-> 支流多智能算法教学（gym 环境为主）—-> 单智能多智能题实战（论文复现偏业务如：无人机优化调度、电力资源调度等我的项目利用）
自然语言解决相干布局：除了单点算法技术外，次要围绕常识图谱构建进行：信息抽取相干技术（含智能标注）—> 常识交融 —-> 常识推理 —-> 图谱利用

上述对于你把握后的期许：

对于 ML，心愿你后续能够乱杀数学建模相干较量（加入就获奖保底，top 还是难的须要钻研）
能够理论解决事实中一些优化调度问题，而非停留在 gym 环境下的一些游戏 demo 玩玩。（更深层次可能须要本人钻研了，难度还是很大的）
把握可常识图谱全流程构建其中各个重要环节算法，蕴含图数据库相干常识。

这三块畛域耦合状况比拟大，后续会通过比方：搜寻举荐零碎整个我的项目进行耦合，各项算法都会耦合在其中。举例：常识图谱就会用到（图算法、NLP、ML 相干算法），搜寻举荐零碎（除了该畛域召回粗排精排重排混排等算法外，还有强化学习、常识图谱等耦合在其中）。饼画的有点大，前面缓缓实现。

关于机器学习:机器学习系列入门系列七基于英雄联盟数据集的LightGBM的分类预测

1. 机器学习系列入门系列 [七]：基于英雄联盟数据集的 LightGBM 的分类预测

1.1 LightGBM 原理简介

1.2 LightGBM 的利用

2. 相干流程

3. 基于英雄联盟数据集的 LightGBM 分类实战

3.1 基本参数调整

3.2 针对训练速度的参数调整

3.3 针对准确率的参数调整

3.4 针对过拟合的参数调整

4. 总结