关于大数据:大数据开发从cogroup的实现来看join是宽依赖还是窄依赖

116次阅读

共计 4513 个字符，预计需要花费 12 分钟才能阅读完成。

后面一篇文章提到大数据开发 -Spark Join 原理详解, 本文从源码角度来看 cogroup 的 join 实现

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object JoinDemo {def main(args: Array[String]): Unit = {val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]") 
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    
    val random = scala.util.Random
    val col1 = Range(1, 50).map(idx => (random.nextInt(10), s"user$idx"))
    val col2 = Array((0, "BJ"), (1, "SH"), (2, "GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0, "CD"))
    val rdd1: RDD[(Int, String)] = sc.makeRDD(col1) 
    val rdd2: RDD[(Int, String)] = sc.makeRDD(col2)
    val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2) 
    println(rdd3.dependencies)
    val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
    println(rdd4.dependencies)
    sc.stop()}
}

剖析下面一段代码，打印后果是什么，这种 join 是宽依赖还是窄依赖，为什么是这样

对于 stage 划分和宽依赖窄依赖的关系，从 2.1.3 如何区别宽依赖和窄依赖就晓得 stage 与宽依赖对应，所以从 rdd3 和 rdd4 的 stage 的依赖图就能够区别宽依赖，能够看到 join 划分除了新的 stage，所以 rdd3 的生成事宽依赖，另外rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3))) 是另外的依赖图，所以能够看到 partitionBy 当前再没有划分新的 stage，所以是窄依赖。

ring4all.com/Ft9l6DJcMGv4Z_pGWtxmnN9FSv6f) ![Uploading file…]()

后面晓得论断，是从 ui 图外面看到的，当初看 join 源码是如何实现的（基于 spark2.4.5）

先进去入口办法，其中 withScope 的做法能够了解为装璜器，为了在 sparkUI 中能展现更多的信息。所以把所有创立的 RDD 的办法都包裹起来，同时用 RDDOperationScope 记录 RDD 的操作历史和关联，就能达成指标。

  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {join(other, defaultPartitioner(self, other))
  }

上面来看defaultPartitioner 的实现，其目标就是在默认值和分区器之间取一个较大的，返回分区器

def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {val rdds = (Seq(rdd) ++ others)
    // 判断有没有设置分区器 partitioner
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
    
    // 如果设置了 partitioner，则取设置 partitioner 的最大分区数
    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {Some(hasPartitioner.maxBy(_.partitions.length))
    } else {None}
 
    // 判断是否设置了 spark.default.parallelism，如果设置了则取 spark.default.parallelism
    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {rdd.context.defaultParallelism} else {rdds.map(_.partitions.length).max
    }
 
    // If the existing max partitioner is an eligible one, or its partitions number is larger
    // than the default number of partitions, use the existing partitioner.
    // 次要判断传入 rdd 是否设置了默认的 partitioner 以及设置的 partitioner 是否非法                
    // 或者设置的 partitioner 分区数大于默认的分区数 
    // 条件成立则取传入 rdd 最大的分区数，否则取默认的分区数
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {hasMaxPartitioner.get.partitioner.get} else {new HashPartitioner(defaultNumPartitions)
    }
  }

  private def isEligiblePartitioner(hasMaxPartitioner: RDD[_],
     rdds: Seq[RDD[_]]): Boolean = {val maxPartitions = rdds.map(_.partitions.length).max
    log10(maxPartitions) - log10(hasMaxPartitioner.getNumPartitions) < 1
  }
}

再进入 join 的重载办法，外面有个new CoGroupedRDD[K](Seq(self, other), partitioner)

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
    : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {throw new SparkException("HashPartitioner cannot partition array keys.")
  }
  //partitioner 通过比照失去的默认分区器，次要是分区器中的分区数
  val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
  cg.mapValues {case Array(vs, w1s) =>
    (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
  }
}


  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope {join(other, new HashPartitioner(numPartitions))
  }

最初来看 CoGroupedRDD，这是决定是宽依赖还是窄依赖的中央，能够看到如果右边 rdd 的分区和下面抉择给定的分区器统一，则认为是窄依赖，否则是宽依赖

  override def getDependencies: Seq[Dependency[_]] = {rdds.map { rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {logDebug("Adding one-to-one dependency with" + rdd)
        new OneToOneDependency(rdd)
      } else {logDebug("Adding shuffle dependency with" + rdd)
        new ShuffleDependency[K, Any, CoGroupCombiner](rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }

总结，join 时候能够指定分区数，如果 join 操作左右的 rdd 的 分区形式和分区数统一 则不会产生 shuffle，否则就会 shuffle，而是宽依赖，分区形式和分区数的体现就是分区器。
吴邪，小三爷，混迹于后盾，大数据，人工智能畛域的小菜鸟。
更多请关注

正文完

大数据

发表至：大数据

2021-02-12

0

关于大数据:安利一个Python大数据分析神器

关于大数据:MaxCompute-湖仓一体近实时增量处理技术架构揭秘

关于大数据:阿里巴巴云数据仓库-MaxCompute-数据安全最佳实践

关于macos:macos禁止开机启动项

关于大数据:大数据开发从cogroup的实现来看join是宽依赖还是窄依赖

1. 剖析上面的代码

2. 从 spark 的 ui 界面来查看运行状况

3.join 的源码实现

Just My Socks（注册教程内含优惠码）

关于大数据:大数据开发从cogroup的实现来看join是宽依赖还是窄依赖

1. 剖析上面的代码

2. 从 spark 的 ui 界面来查看运行状况

3.join 的源码实现

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）