关于大数据:图解大数据-Spark机器学习下建模与超参调优

作者：韩信子 @ShowMeAI
教程地址：http://www.showmeai.tech/tutorials/84
本文地址：http://www.showmeai.tech/article-detail/181
申明：版权所有，转载请分割平台与作者并注明出处

分类是一种重要的机器学习和数据挖掘技术。分类的目标是依据数据集的特点结构一个分类函数或分类模型(也经常称作分类器)，该模型能把未知类别的样本映射到给定类别中的一种技术。

分类的目标就是剖析输出数据，通过在训练集中的数据体现进去的个性，为每一个类找到一种精确的形容或者模型，采纳该种办法 (模型) 将隐含函数示意进去。

结构分类模型的过程个别分为训练和测试两个阶段。

在结构模型之前，将数据集随机地分为训练数据集和测试数据集。
先应用训练数据集来结构分类模型，而后应用测试数据集来评估模型的分类准确率。
如果认为模型的准确率能够承受，就能够用该模型对其它数据元组进分类。

一般来说，测试阶段的代价远低于训练阶段。

逻辑回归（logistic regression）是统计学习中的经典分类办法，属于对数线性模型。logistic 回归的因变量能够是二分类的，也能够是多分类的。

获取数据集与代码 → ShowMeAI 的官网 GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml.classification import LogisticRegression
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("LogisticRegressionSummary") \
    .getOrCreate()

# 加载数据
training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# 拟合模型
lrModel = lr.fit(training)

# 模型信息总结与输入
trainingSummary = lrModel.summary

# 输入每一轮的损失函数值
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
    print(objective)

# ROC 曲线
trainingSummary.roc.show()
print("areaUnderROC:" + str(trainingSummary.areaUnderROC))

spark.stop()

反对向量机 SVM 是一种二分类模型。它的根本模型是定义在特色空间上的距离最大的线性分类器。反对向量机学习办法蕴含 3 种模型：线性可分反对向量机、线性反对向量机及非线性反对向量机。

当训练数据线性可分时，通过硬距离最大化，学习一个线性的分类器，即线性可分反对向量机；
当训练数据近似线性可分时，通过软距离最大化，也学习一个线性的分类器，即线性反对向量机；
当训练数据线性不可分时，通过应用核技巧及软距离最大化，学习非线性反对向量机。

线性反对向量机反对 L1 和 L2 的正则化变型。

获取数据集与代码 → ShowMeAI 的官网 GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml.classification import LinearSVC

# Load training data
training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

lsvc = LinearSVC(maxIter=10, regParam=0.1)

# Fit the model
lsvcModel = lsvc.fit(training)

# Print the coefficients and intercept for linear SVC
print("Coefficients:" + str(lsvcModel.coefficients))
print("Intercept:" + str(lsvcModel.intercept))

决策树（decision tree）是一种根本的分类与回归办法，这里次要介绍用于分类的决策树。决策树模式呈树形构造，其中每个外部节点示意一个属性上的测试，每个分支代表一个测试输入，每个叶节点代表一种类别。

学习时利用训练数据，依据损失函数最小化的准则建设决策树模型；预测时，对新的数据，利用决策树模型进行分类。

获取数据集与代码 → ShowMeAI 的官网 GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load the data stored in LIBSVM format as a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

回归也是一种重要的机器学习和数据挖掘技术。回归的目标是依据数据集的特点结构一个映射函数或模型，该模型能依据未知样本的输出失去间断值的输入。

线性回归是利用数理统计中回归剖析，来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析办法，使用非常宽泛。其表达形式为 y = w’x+e，e 为误差遵从均值为 0 的正态分布。

回归剖析中，只包含一个自变量和一个因变量，且二者的关系可用一条直线近似示意，这种回归剖析称为一元线性回归剖析。
如果回归剖析中包含两个或两个以上的自变量，且因变量和自变量之间是线性关系，则称为多元线性回归剖析。

获取数据集与代码 → ShowMeAI 的官网 GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml.regression import LinearRegression

# Load training data
training = spark.read.format("libsvm")\
    .load("data/mllib/sample_linear_regression_data.txt")

lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))

# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

决策树模型既能够求解分类问题（对应的就是 classification tree），也即对应的目标值是类别型数据，也能够利用于回归预测问题的求解（regression tree），其输入值则能够是间断的实数值。

依据从业年限和体现，预估棒球运动员的工资。如图，有 1987 个数据样本，蕴含 322 个棒球运动员。红黄示意高支出，蓝绿示意低收入。横坐标是年限，纵坐标是体现。

获取数据集与代码 → ShowMeAI 的官网 GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("DecisionTreeRegressionExample")\
    .getOrCreate()

# 加载数据
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeRegressor(featuresCol="indexedFeatures")

# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, dt])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

treeModel = model.stages[1]
# summary only
print(treeModel)

spark.stop()

利用无标签的数据学习数据的散布或数据与数据之间的关系被称为无监督学习。

有监督学习和无监督学习的最大区别在于数据是否有标签
无监督学习最常利用的场景是聚类 (clustering) 和降维(Dimension Reduction)

聚类（Clustering）是机器学习中一类重要的办法。其次要思维应用样本的不同特色属性，依据某一给定的类似度度量形式（如欧式间隔）找到类似的样本，并依据间隔将样本划分成不同的组。聚类属于典型的无监督学习（Unsupervised Learning）办法。

与监督学习（如分类器）相比，无监督学习的训练集没有人为标注的后果。在非监督式学习中，数据并不被特地标识，学习模型是为了推断出数据的一些外在构造。

Spark 的 MLlib 库提供了许多可用的聚类办法的实现，如 K-Means、高斯混合模型、Power Iteration Clustering（PIC）、隐狄利克雷散布（LDA）以及 K-Means 办法的变种二分 K -Means（Bisecting K-Means）和流式 K -Means（Streaming K-Means）等。

K-Means 是一个迭代求解的聚类算法，其属于划分（Partitioning）型的聚类办法，即首先创立 K 个划分，而后迭代地将样本从一个划分转移到另一个划分来改善最终聚类的品质，K-Means 的过程大抵如下：

1. 依据给定的 k 值，选取 k 个样本点作为初始划分核心；
2. 计算所有样本点到每一个划分核心的间隔，并将所有样本点划分到间隔最近的划分核心；
3. 计算每个划分中样本点的平均值，将其作为新的核心；
循环进行 2~3 步直至达到最大迭代次数，或划分核心的变动小于某一预约义阈值

获取数据集与代码 → ShowMeAI 的官网 GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

spark = SparkSession\
        .builder\
        .appName("KMeansExample")\
        .getOrCreate()

dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# 训练 K -means 聚类模型
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# 预测(即调配聚类核心)
predictions = model.transform(dataset)

# 依据 Silhouette 得分评估(pyspark2.2 里新加)
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance =" + str(silhouette))

# 输入预测后果
print("predicted Center:")
for center in predictions[['prediction']].collect():
    print(center.asDict())

# 聚类核心
centers = model.clusterCenters()
print("Cluster Centers:")
for center in centers:
    print(center)

spark.stop()

主成分剖析（PCA）是一种对数据进行旋转变换的统计学办法，其本质是在线性空间中进行一个基变换，使得变换后的数据投影在一组新的“坐标轴”上的方差最大化，随后，裁剪掉变换后方差很小的“坐标轴”，剩下的新“坐标轴”即被称为主成分（Principal Component），它们能够在一个较低维度的子空间中尽可能地示意原有数据的性质。
主成分剖析被广泛应用在各种统计学、机器学习问题中，是最常见的降维办法之一。

获取数据集与代码 → ShowMeAI 的官网 GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

spark = SparkSession\
        .builder\
        .appName("PCAExample")\
        .getOrCreate()

# 构建一份 fake data
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
        (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
        (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

# PCA 降维
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)

spark.stop()

在机器学习中，模型抉择是十分重要的工作。

应用数据找到解决具体问题的最佳模型和参数，这个过程也叫做调试(Tuning)
调试能够在独立的预计器中实现 (如逻辑回归)，也能够在工作流(蕴含多样算法、特色工程等) 中实现
用户应该一次性调优整个工作流，而不是独立地调整 PipeLine 中的每个组成部分

MLlib 反对穿插验证 Cross Validator 和训练验证宰割 Train Validation Split 两个模型抉择工具。应用这些工具要求蕴含：

预计器：待调试的算法或管线。
一系列参数表（ParamMaps）：可选参数，也叫做“参数网格”搜寻空间。
评估器：评估模型拟合水平的准则或办法。

穿插验证 CrossValidato 将数据集切分成 k 折叠数据汇合，并被别离用于训练和测试。例如：

k= 3 时，CrossValidator 会生成 3 个 (训练数据, 测试数据) 对，每一个数据对的训练数据占 2 /3，测试数据占 1 /3。
为了评估一个 ParamMap，CrossValidator 会计算这 3 个不同的 (训练, 测试) 数据集对在 Estimator 拟合出的模型上的均匀评估指标。
在找出最好的 ParamMap 后，CrossValidator 会应用这个 ParamMap 和整个的数据集来从新拟合 Estimator。

也就是说，通过穿插验证找到最佳的 ParamMap，利用此 ParamMap 在整个训练集上能够训练（fit）出一个泛化能力强，误差绝对小的的最佳模型。

穿插验证的代价比拟昂扬，为此 Spark 也为超参数调优提供了训练 - 验证切分 TrainValidationSplit。

TrainValidationSplit 创立繁多的 (训练, 测试) 数据集对。
它应用 trainRatio 参数将数据集切分成两局部。例如，当设置 trainRatio=0.75 时，TrainValidationSplit 将会将数据切分 75% 作为数据集，25% 作为验证集，来生成训练、测试集对，并最终应用最好的 ParamMap 和残缺的数据集来拟合评估器。

绝对于 CrossValidator 对每一个参数进行 k 次评估，TrainValidationSplit 只对每个参数组合评估 1 次

所以评估代价较低
然而，当训练数据集不够大的时候其后果绝对不够可信

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("CrossValidatorExample")\
    .getOrCreate()

# $example on$
# Prepare training documents, which are labeled.
training = spark.createDataFrame([(0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([(4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    print(row)

spark.stop()

数据迷信工具速查 | Spark 使用指南(RDD 版) http://www.showmeai.tech/article-detail/106
数据迷信工具速查 | Spark 使用指南(SQL 版) http://www.showmeai.tech/article-detail/107
黄美灵，Spark MLlib 机器学习：算法、源码及实战详解，电子工业出版社，2016
应用 ML Pipeline 构建机器学习工作流 https://www.ibm.com/developerworks/cn/opensource/os-cn-spark-practice5/index.html
Spark 官网文档：机器学习库 (MLlib) 指南，http://spark.apachecn.org/docs/cn/2.2.0/ml-guide.html

图解大数据 | 导论：大数据生态与利用
图解大数据 | 分布式平台：Hadoop 与 Map-reduce 详解
图解大数据 | 实操案例：Hadoop 零碎搭建与环境配置
图解大数据 | 实操案例：利用 map-reduce 进行大数据统计
图解大数据 | 实操案例：Hive 搭建与利用案例
图解大数据 | 海量数据库与查问：Hive 与 HBase 详解
图解大数据 | 大数据分析开掘框架：Spark 初步
图解大数据 | Spark 操作：基于 RDD 的大数据处理剖析
图解大数据 | Spark 操作：基于 Dataframe 与 SQL 的大数据处理剖析
图解大数据 | 综合案例：应用 spark 剖析美国新冠肺炎疫情数据
图解大数据 | 综合案例：应用 Spark 剖析开掘批发交易数据
图解大数据 | 综合案例：应用 Spark 剖析开掘音乐专辑数据
图解大数据 | 流式数据处理：Spark Streaming
图解大数据 | Spark 机器学习(上)- 工作流与特色工程
图解大数据 | Spark 机器学习(下)- 建模与超参调优
图解大数据 | Spark GraphFrames：基于图的数据分析开掘

图解 Python 编程：从入门到精通系列教程
图解数据分析：从入门到精通系列教程
图解 AI 数学根底：从入门到精通系列教程
图解大数据技术：从入门到精通系列教程
图解机器学习算法：从入门到精通系列教程

关于大数据:图解大数据-Spark机器学习下建模与超参调优

1. 分类、回归与聚类模型

1）分类算法概述

（1）逻辑回归

（2）反对向量机 SVM 分类器

（3）决策树分类器

2）回归算法概述

（1）线性回归

（2）决策树回归

3）无监督学习概述

（1）聚类算法

（2）K-Means 聚类

（3）降维与 PCA

2. 超参调优：数据切分与网格搜寻

1）机器学习流程与超参数调优

2）穿插验证和训练验证切分

3. 参考资料

ShowMeAI 相干文章举荐

ShowMeAI 系列教程举荐