作者：韩信子@ShowMeAI
教程地址：http://www.showmeai.tech/tutorials/84
本文地址：http://www.showmeai.tech/article-detail/181
申明：版权所有，转载请分割平台与作者并注明出处

1.分类、回归与聚类模型

1）分类算法概述

分类是一种重要的机器学习和数据挖掘技术。分类的目标是依据数据集的特点结构一个分类函数或分类模型(也经常称作分类器)，该模型能把未知类别的样本映射到给定类别中的一种技术。

分类的目标就是剖析输出数据，通过在训练集中的数据体现进去的个性，为每一个类找到一种精确的形容或者模型，采纳该种办法(模型)将隐含函数示意进去。

结构分类模型的过程个别分为训练和测试两个阶段。

在结构模型之前，将数据集随机地分为训练数据集和测试数据集。
先应用训练数据集来结构分类模型，而后应用测试数据集来评估模型的分类准确率。
如果认为模型的准确率能够承受，就能够用该模型对其它数据元组进分类。

一般来说，测试阶段的代价远低于训练阶段。

（1）逻辑回归

逻辑回归（logistic regression）是统计学习中的经典分类办法，属于对数线性模型。logistic回归的因变量能够是二分类的，也能够是多分类的。

获取数据集与代码 → ShowMeAI的官网GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml.classification import LogisticRegressionfrom pyspark.sql import SparkSessionspark = SparkSession \    .builder \    .appName("LogisticRegressionSummary") \    .getOrCreate()# 加载数据training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)# 拟合模型lrModel = lr.fit(training)# 模型信息总结与输入trainingSummary = lrModel.summary# 输入每一轮的损失函数值objectiveHistory = trainingSummary.objectiveHistoryprint("objectiveHistory:")for objective in objectiveHistory:    print(objective)# ROC曲线trainingSummary.roc.show()print("areaUnderROC: " + str(trainingSummary.areaUnderROC))spark.stop()

（2）反对向量机SVM分类器

反对向量机SVM是一种二分类模型。它的根本模型是定义在特色空间上的距离最大的线性分类器。反对向量机学习办法蕴含3种模型：线性可分反对向量机、线性反对向量机及非线性反对向量机。

当训练数据线性可分时，通过硬距离最大化，学习一个线性的分类器，即线性可分反对向量机；
当训练数据近似线性可分时，通过软距离最大化，也学习一个线性的分类器，即线性反对向量机；
当训练数据线性不可分时，通过应用核技巧及软距离最大化，学习非线性反对向量机。

线性反对向量机反对L1和L2的正则化变型。

获取数据集与代码 → ShowMeAI的官网GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml.classification import LinearSVC# Load training datatraining = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")lsvc = LinearSVC(maxIter=10, regParam=0.1)# Fit the modellsvcModel = lsvc.fit(training)# Print the coefficients and intercept for linear SVCprint("Coefficients: " + str(lsvcModel.coefficients))print("Intercept: " + str(lsvcModel.intercept))

（3）决策树分类器

决策树（decision tree）是一种根本的分类与回归办法，这里次要介绍用于分类的决策树。决策树模式呈树形构造，其中每个外部节点示意一个属性上的测试，每个分支代表一个测试输入，每个叶节点代表一种类别。

学习时利用训练数据，依据损失函数最小化的准则建设决策树模型；预测时，对新的数据，利用决策树模型进行分类。

获取数据集与代码 → ShowMeAI的官网GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifierfrom pyspark.ml.feature import StringIndexer, VectorIndexerfrom pyspark.ml.evaluation import MulticlassClassificationEvaluator# Load the data stored in LIBSVM format as a DataFrame.data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")# Index labels, adding metadata to the label column.# Fit on whole dataset to include all labels in index.labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)# Automatically identify categorical features, and index them.# We specify maxCategories so features with > 4 distinct values are treated as continuous.featureIndexer =\    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)# Split the data into training and test sets (30% held out for testing)(trainingData, testData) = data.randomSplit([0.7, 0.3])# Train a DecisionTree model.dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")# Chain indexers and tree in a Pipelinepipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])# Train model.  This also runs the indexers.model = pipeline.fit(trainingData)# Make predictions.predictions = model.transform(testData)# Select example rows to display.predictions.select("prediction", "indexedLabel", "features").show(5)# Select (prediction, true label) and compute test errorevaluator = MulticlassClassificationEvaluator(    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")accuracy = evaluator.evaluate(predictions)print("Test Error = %g " % (1.0 - accuracy))treeModel = model.stages[2]# summary onlyprint(treeModel)

2）回归算法概述

回归也是一种重要的机器学习和数据挖掘技术。回归的目标是依据数据集的特点结构一个映射函数或模型，该模型能依据未知样本的输出失去间断值的输入。

（1）线性回归

线性回归是利用数理统计中回归剖析，来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析办法，使用非常宽泛。其表达形式为y = w’x+e，e为误差遵从均值为0的正态分布。

回归剖析中，只包含一个自变量和一个因变量，且二者的关系可用一条直线近似示意，这种回归剖析称为一元线性回归剖析。
如果回归剖析中包含两个或两个以上的自变量，且因变量和自变量之间是线性关系，则称为多元线性回归剖析。

获取数据集与代码 → ShowMeAI的官网GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml.regression import LinearRegression# Load training datatraining = spark.read.format("libsvm")\    .load("data/mllib/sample_linear_regression_data.txt")lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)# Fit the modellrModel = lr.fit(training)# Print the coefficients and intercept for linear regressionprint("Coefficients: %s" % str(lrModel.coefficients))print("Intercept: %s" % str(lrModel.intercept))# Summarize the model over the training set and print out some metricstrainingSummary = lrModel.summaryprint("numIterations: %d" % trainingSummary.totalIterations)print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))trainingSummary.residuals.show()print("RMSE: %f" % trainingSummary.rootMeanSquaredError)print("r2: %f" % trainingSummary.r2)

（2）决策树回归

决策树模型既能够求解分类问题（对应的就是 classification tree），也即对应的目标值是类别型数据，也能够利用于回归预测问题的求解（regression tree），其输入值则能够是间断的实数值。

依据从业年限和体现，预估棒球运动员的工资。如图，有1987个数据样本，蕴含322个棒球运动员。红黄示意高支出，蓝绿示意低收入。横坐标是年限，纵坐标是体现。

获取数据集与代码 → ShowMeAI的官网GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

from pyspark.ml import Pipelinefrom pyspark.ml.regression import DecisionTreeRegressorfrom pyspark.ml.feature import VectorIndexerfrom pyspark.ml.evaluation import RegressionEvaluatorfrom pyspark.sql import SparkSessionspark = SparkSession\    .builder\    .appName("DecisionTreeRegressionExample")\    .getOrCreate()# 加载数据data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")# Automatically identify categorical features, and index them.# We specify maxCategories so features with > 4 distinct values are treated as continuous.featureIndexer =\    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)# Split the data into training and test sets (30% held out for testing)(trainingData, testData) = data.randomSplit([0.7, 0.3])# Train a DecisionTree model.dt = DecisionTreeRegressor(featuresCol="indexedFeatures")# Chain indexer and tree in a Pipelinepipeline = Pipeline(stages=[featureIndexer, dt])# Train model.  This also runs the indexer.model = pipeline.fit(trainingData)# Make predictions.predictions = model.transform(testData)# Select example rows to display.predictions.select("prediction", "label", "features").show(5)# Select (prediction, true label) and compute test errorevaluator = RegressionEvaluator(    labelCol="label", predictionCol="prediction", metricName="rmse")rmse = evaluator.evaluate(predictions)print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)treeModel = model.stages[1]# summary onlyprint(treeModel)spark.stop()

3）无监督学习概述

利用无标签的数据学习数据的散布或数据与数据之间的关系被称为无监督学习。

有监督学习和无监督学习的最大区别在于数据是否有标签
无监督学习最常利用的场景是聚类(clustering)和降维(Dimension Reduction)

（1）聚类算法

聚类（Clustering）是机器学习中一类重要的办法。其次要思维应用样本的不同特色属性，依据某一给定的类似度度量形式（如欧式间隔）找到类似的样本，并依据间隔将样本划分成不同的组。聚类属于典型的无监督学习（Unsupervised Learning）办法。

与监督学习（如分类器）相比，无监督学习的训练集没有人为标注的后果。在非监督式学习中，数据并不被特地标识，学习模型是为了推断出数据的一些外在构造。

Spark的MLlib库提供了许多可用的聚类办法的实现，如 K-Means、高斯混合模型、Power Iteration Clustering（PIC）、隐狄利克雷散布（LDA）以及 K-Means 办法的变种二分K-Means（Bisecting K-Means）和流式K-Means（Streaming K-Means）等。

（2）K-Means聚类

K-Means 是一个迭代求解的聚类算法，其属于划分（Partitioning）型的聚类办法，即首先创立K个划分，而后迭代地将样本从一个划分转移到另一个划分来改善最终聚类的品质，K-Means 的过程大抵如下：

1.依据给定的k值，选取k个样本点作为初始划分核心；
2.计算所有样本点到每一个划分核心的间隔，并将所有样本点划分到间隔最近的划分核心；
3.计算每个划分中样本点的平均值，将其作为新的核心；
循环进行2~3步直至达到最大迭代次数，或划分核心的变动小于某一预约义阈值

获取数据集与代码 → ShowMeAI的官网GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

spark = SparkSession\        .builder\        .appName("KMeansExample")\        .getOrCreate()dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")# 训练K-means聚类模型kmeans = KMeans().setK(2).setSeed(1)model = kmeans.fit(dataset)# 预测(即调配聚类核心)predictions = model.transform(dataset)# 依据Silhouette得分评估(pyspark2.2里新加)evaluator = ClusteringEvaluator()silhouette = evaluator.evaluate(predictions)print("Silhouette with squared euclidean distance = " + str(silhouette))# 输入预测后果print("predicted Center: ")for center in predictions[['prediction']].collect():    print(center.asDict())# 聚类核心centers = model.clusterCenters()print("Cluster Centers: ")for center in centers:    print(center)spark.stop()

（3）降维与PCA

主成分剖析（PCA）是一种对数据进行旋转变换的统计学办法，其本质是在线性空间中进行一个基变换，使得变换后的数据投影在一组新的“坐标轴”上的方差最大化，随后，裁剪掉变换后方差很小的“坐标轴”，剩下的新“坐标轴”即被称为主成分（Principal Component），它们能够在一个较低维度的子空间中尽可能地示意原有数据的性质。
主成分剖析被广泛应用在各种统计学、机器学习问题中，是最常见的降维办法之一。

获取数据集与代码 → ShowMeAI的官网GitHub https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
运行代码段与学习 → 在线编程环境 http://blog.showmeai.tech/python3-compiler

spark = SparkSession\        .builder\        .appName("PCAExample")\        .getOrCreate()# 构建一份fake datadata = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),        (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),        (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]df = spark.createDataFrame(data, ["features"])# PCA降维pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")model = pca.fit(df)result = model.transform(df).select("pcaFeatures")result.show(truncate=False)spark.stop()

2.超参调优：数据切分与网格搜寻

1）机器学习流程与超参数调优

在机器学习中，模型抉择是十分重要的工作。

应用数据找到解决具体问题的最佳模型和参数，这个过程也叫做调试(Tuning)
调试能够在独立的预计器中实现(如逻辑回归)，也能够在工作流(蕴含多样算法、特色工程等)中实现
用户应该一次性调优整个工作流，而不是独立地调整PipeLine中的每个组成部分

2）穿插验证和训练验证切分

MLlib反对穿插验证 Cross Validator和训练验证宰割Train Validation Split 两个模型抉择工具。应用这些工具要求蕴含：

预计器：待调试的算法或管线。
一系列参数表（ParamMaps）：可选参数，也叫做“参数网格”搜寻空间。
评估器：评估模型拟合水平的准则或办法。

穿插验证CrossValidato将数据集切分成k折叠数据汇合，并被别离用于训练和测试。例如：

k=3时，CrossValidator会生成3个 (训练数据, 测试数据) 对，每一个数据对的训练数据占2/3，测试数据占1/3。
为了评估一个ParamMap，CrossValidator 会计算这3个不同的 (训练, 测试) 数据集对在Estimator拟合出的模型上的均匀评估指标。
在找出最好的ParamMap后，CrossValidator 会应用这个ParamMap和整个的数据集来从新拟合Estimator。

也就是说，通过穿插验证找到最佳的ParamMap，利用此ParamMap在整个训练集上能够训练（fit）出一个泛化能力强，误差绝对小的的最佳模型。

穿插验证的代价比拟昂扬，为此Spark也为超参数调优提供了训练-验证切分TrainValidationSplit。

TrainValidationSplit创立繁多的 (训练, 测试) 数据集对。
它应用trainRatio参数将数据集切分成两局部。例如，当设置trainRatio=0.75时，TrainValidationSplit将会将数据切分75%作为数据集，25%作为验证集，来生成训练、测试集对，并最终应用最好的ParamMap和残缺的数据集来拟合评估器。

绝对于CrossValidator对每一个参数进行k次评估，TrainValidationSplit只对每个参数组合评估1次

所以评估代价较低
然而，当训练数据集不够大的时候其后果绝对不够可信

from pyspark.ml import Pipelinefrom pyspark.ml.classification import LogisticRegressionfrom pyspark.ml.evaluation import BinaryClassificationEvaluatorfrom pyspark.ml.feature import HashingTF, Tokenizerfrom pyspark.ml.tuning import CrossValidator, ParamGridBuilderfrom pyspark.sql import SparkSessionspark = SparkSession\    .builder\    .appName("CrossValidatorExample")\    .getOrCreate()# $example on$# Prepare training documents, which are labeled.training = spark.createDataFrame([    (0, "a b c d e spark", 1.0),    (1, "b d", 0.0),    (2, "spark f g h", 1.0),    (3, "hadoop mapreduce", 0.0),    (4, "b spark who", 1.0),    (5, "g d a y", 0.0),    (6, "spark fly", 1.0),    (7, "was mapreduce", 0.0),    (8, "e spark program", 1.0),    (9, "a e c l", 0.0),    (10, "spark compile", 1.0),    (11, "hadoop software", 0.0)], ["id", "text", "label"])# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.tokenizer = Tokenizer(inputCol="text", outputCol="words")hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")lr = LogisticRegression(maxIter=10)pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.# This will allow us to jointly choose parameters for all Pipeline stages.# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.# We use a ParamGridBuilder to construct a grid of parameters to search over.# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.paramGrid = ParamGridBuilder() \    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \    .addGrid(lr.regParam, [0.1, 0.01]) \    .build()crossval = CrossValidator(estimator=pipeline,                          estimatorParamMaps=paramGrid,                          evaluator=BinaryClassificationEvaluator(),                          numFolds=2)  # use 3+ folds in practice# Run cross-validation, and choose the best set of parameters.cvModel = crossval.fit(training)# Prepare test documents, which are unlabeled.test = spark.createDataFrame([    (4, "spark i j k"),    (5, "l m n"),    (6, "mapreduce spark"),    (7, "apache hadoop")], ["id", "text"])# Make predictions on test documents. cvModel uses the best model found (lrModel).prediction = cvModel.transform(test)selected = prediction.select("id", "text", "probability", "prediction")for row in selected.collect():    print(row)spark.stop()

3.参考资料

数据迷信工具速查 | Spark使用指南(RDD版) http://www.showmeai.tech/article-detail/106
数据迷信工具速查 | Spark使用指南(SQL版) http://www.showmeai.tech/article-detail/107
黄美灵，Spark MLlib机器学习：算法、源码及实战详解，电子工业出版社，2016
应用 ML Pipeline 构建机器学习工作流https://www.ibm.com/developerworks/cn/opensource/os-cn-spark-practice5/index.html
Spark官网文档：机器学习库 (MLlib) 指南，http://spark.apachecn.org/docs/cn/2.2.0/ml-guide.html

ShowMeAI相干文章举荐

图解大数据 | 导论：大数据生态与利用
图解大数据 | 分布式平台：Hadoop与Map-reduce详解
图解大数据 | 实操案例：Hadoop零碎搭建与环境配置
图解大数据 | 实操案例：利用map-reduce进行大数据统计
图解大数据 | 实操案例：Hive搭建与利用案例
图解大数据 | 海量数据库与查问：Hive与HBase详解
图解大数据 | 大数据分析开掘框架：Spark初步
图解大数据 | Spark操作：基于RDD的大数据处理剖析
图解大数据 | Spark操作：基于Dataframe与SQL的大数据处理剖析
图解大数据 | 综合案例：应用spark剖析美国新冠肺炎疫情数据
图解大数据 | 综合案例：应用Spark剖析开掘批发交易数据
图解大数据 | 综合案例：应用Spark剖析开掘音乐专辑数据
图解大数据 | 流式数据处理：Spark Streaming
图解大数据 | Spark机器学习(上)-工作流与特色工程
图解大数据 | Spark机器学习(下)-建模与超参调优
图解大数据 | Spark GraphFrames：基于图的数据分析开掘

ShowMeAI系列教程举荐

图解Python编程：从入门到精通系列教程
图解数据分析：从入门到精通系列教程
图解AI数学根底：从入门到精通系列教程
图解大数据技术：从入门到精通系列教程
图解机器学习算法：从入门到精通系列教程