背景需要
最近在我的项目中有一个场景,依据前端可视化模式传入的参数构建一组SQL语句,利用在Spark Streaming利用的数据同步中。这其实是一个已有的性能,然而发现原先的代码实现发现有较重大的问题,导致该性能在有关联查问时不可用,我通过调研之后决定从新实现。
这些SQL由一般的Lookup SQL和Spark SQL组成,Lookup SQL用于查问关联数据,SparkSQL则用于输入后果,外围问题在于如何正当组织这些表的关联关系。
PS:实现代码为Scala语言。
参数
其中前端传入的参数为
case class UpdateTask( @BeanProperty id: Option[Long], @BeanProperty taskName: Option[String], @BeanProperty taskDesc: Option[String], @BeanProperty sourceInstance: Option[String], @BeanProperty targetInstance: Option[Long], @BeanProperty eventInstance: Option[Long], @BeanProperty sourceTree: Option[Seq[Long]], @BeanProperty selectSourceTree: Option[Seq[Long]], @BeanProperty targetTree: Option[Long], @BeanProperty eventTable: Option[Long], @BeanProperty tableRelation: Option[Seq[TableRelation]], @BeanProperty filterCondition: Option[String], @BeanProperty targetCalculateTableName: Option[String], @BeanProperty targetCalculate: Option[Seq[TargetCalculate]], @BeanProperty sourceTableField: Option[Seq[TableColumnInfo]], @BeanProperty sqlType: Option[Int], @BeanProperty classicSql: Option[String], @BeanProperty sinkConfig: Option[String], @BeanProperty targetPrimaryKey: Option[Seq[String]] ) extends SimpleBaseEntity
所须要用的参数为
eventTable
: 触发表tableRelation
: 表关联关系列表,其中TableRelation
的构造为case class TableRelation(@BeanProperty leftTableSelect: Long, @BeanProperty rightTableSelect: Long, @BeanProperty leftColumnSelect: String, @BeanProperty rightColumnSelect: String)
targetCalculate
: 输入后果的计算表达式,其中TargetCalculate
的构造为case class TargetCalculate(@BeanProperty columnName: String, @BeanProperty config: String)
selectSourceTree
: 所用到的源表
解决方案
当没有关联关系的时候,比较简单,不在此探讨。当有多个关联关系时,应该先查问出被关联的表数据,再查问下一级的表,以此类推,理论场景下可能个别只有一两个表关联,然而毕竟还是须要思考极其状况,原先的实现只思考了简略的关联,简单一点的关联则无奈解决,通过一段时间思考后,决定基于树这种数据结构去实现此性能。
假如传入了如下一些表关系,并且A表为源表(触发表):
A <-> BA <-> CA <-> DB <-> EB <-> FE <-> GC <-> HC <-> I
则通过解决后,能够生成如下一个树
--> E <--> G --> B <--| | --> F |A <----> D | | --> H --> C <--| --> I
在此须要阐明,不须要思考左右程序问题,例如 A <-> B 等价于 B <-> A,在前面对此问题会有阐明。
当传入了多个雷同的表关联关系时,须要做一个聚合,因为前端的参数中,每一个关联关系只蕴含一组关联字段,所以当有多个关联字段时,就传入了多个雷同的关联关系,然而关联字段不同。
失去这个树形关系后,也同时失去了表之间的依赖关系,然而还有一个前提, 每个表只能依赖一个表,假如如下关系:
--> E <--> G --> B <--| | --> F |A <----> D | | --> H --> G <--| --> I
此时,G表既能够由A失去,又能够由E失去,假如从A表失去G表,那么从G表又能够失去E表......产生了歧义,并由此产生一个了有环图。然而咱们需要中目前没有这种关联关系(因为前端配置页面中,没有标识关联的方向性,即目前可视化模式传入的关联关系都是双向,对于一组关系,既能够从A失去B,也能够从B失去A,也就是后面的:A <-> B 等价于 B <-> A),所以不思考这种状况,呈现时给予报错,提醒依赖关系产生了环。如果有方向性的话,咱们生成树的算法会更简略一些,间接DFS即可,然而对于反复呈现的表,须要做额定解决,例如给反复表起别名,保障后果集不会呈现重名字段,否则Spark在处理过程中会产生异样。
在失去这个依赖关系后,前面的事件就好办了,咱们从根节点开始层序遍历(也即为BFS广度优先遍历),逐层构建SQL语句,也能够采纳树的先序遍历(DFS深度优先),只有保障子节点在父节点前面遍历即可,保障前面的SQL语句用到的关联参数在后面的SQL中曾经查问到。
在生成SQL的过程中,为了防止不同库表有雷同的表名或字段名,除了最初一句输入后果的Spark SQL,后面的SQL查问字段均须要起一个别名,在此沿用之前旧代码的计划:应用 {字段名} AS {库名}__{表名}__{字段名}
的模式保障字段名不会反复
代码实现
数据结构类定义
有了思路之后,便开始着手实现此性能,首先定义一个树节点的case类:
case class TableRelationTreeNode(value: Long, // 以后节点的表id parentRelation: LinkRelation, // 和父节点的关联关系 childs: ListBuffer[TableRelationTreeNode] // 子节点 )
LinkRelation
形容了两个表之间的关联关系,是对前端传入的TableRelation
聚合后的后果:
case class LinkRelation(leftTable: Long, // 左表id rightTable: Long, // 右表id linkFields: Seq[(String, String)] // 关联字段, 元组的两个参数别离为左表字段、右表字段 )
关联关系树的构建
/** * @param parentNode 父节点 * @param remainRelations 残余关联关系 */def buildRelationTree(parentNode: TableRelationTreeNode, remainRelations: ListBuffer[LinkRelation]): Any = { if (remainRelations.isEmpty) return val parentTableId = parentNode.value; // 找出关联关系中蕴含父节点的表id val childRelation = remainRelations.filter(e => e.leftTable == parentTableId || e.rightTable == parentTableId) if (childRelation.isEmpty) return // 将关联关系中父节点的关联信息置于左侧,不便后续操作 childRelation .map(e => if (e.leftTable == parentTableId) e else LinkRelation(e.rightTable, e.leftTable, e.linkFields.map(e => (e._2, e._1)))) .foreach{e => parentNode.childs += TableRelationTreeNode(e.rightTable, e, new ListBuffer())} // 移除曾经应用过的关联关系 remainRelations --= childRelation parentNode.childs foreach {buildRelationTree(_, remainRelations)}}
SQL语句生成的外围代码
def buildTransSQL(task: UpdateTask): Seq[String] = { // 存储所有用到的表(namespace为表的信息) val namespacesRef = mutable.HashMap[Long, Namespace]() task.selectSourceTree.get.foreach(i => namespacesRef += (kv = (i, Await.result(namespaceDal.findById(i), minTimeOut).get))) val targetTableId = task.targetTree.get // 指标表 val targetNamespace = Await.result(namespaceDal.findById(targetTableId), minTimeOut).head namespacesRef.put(targetTableId, targetNamespace) val eventTableId = task.eventTable.get // 事件表(源/触发表) val eventNamespace = namespacesRef(eventTableId) // 没有计算逻辑,当做镜像同步,间接SELECT * ... if (task.targetCalculate.isEmpty) return Seq.newBuilder.+=(s"spark_sql= select * from ${eventNamespace.nsTable};").result() val transSqlList = new ListBuffer[String] // 先将触发表的所有字段查问进去 transSqlList += s"spark_sql= select ${ sourceDataDal.getSourceDataTableField(eventTableId).filter(_ != "ums_active_").map(e => { s"$e AS ${eventNamespace.nsDatabase}__${eventNamespace.nsTable}__$e" }).mkString(", ") } from ${eventNamespace.nsTable}" if (task.getTableRelation.nonEmpty) { val remainLinks = new ListBuffer[LinkRelation]() // 聚合反复的表关联关系 task.getTableRelation.getOrElse(Seq.empty) .map(e => { if (e.leftTableSelect > e.rightTableSelect) { TableRelation( leftTableSelect = e.rightTableSelect, rightTableSelect = e.leftTableSelect, leftColumnSelect = e.rightColumnSelect, rightColumnSelect = e.leftColumnSelect ) } else e }) .groupBy(e => s"${e.leftTableSelect}-${e.rightTableSelect}") .map(e => { LinkRelation( leftTable = e._2.head.leftTableSelect, rightTable = e._2.head.rightTableSelect, linkFields = e._2.map(e => (e.leftColumnSelect, e.rightColumnSelect)) ) }) foreach { remainLinks += _ } // 根结点 val rootTreeNode = TableRelationTreeNode( eventTableId, null, new ListBuffer[TableRelationTreeNode] ) // 构建关系树 buildRelationTree(rootTreeNode, remainLinks) // 如果有残余的关系未被应用,则阐明有无奈连贯到根节点的关系,抛出异样 if (remainLinks.nonEmpty) { throw new IllegalArgumentException(s"游离的关联关系:${ remainLinks.map(e => { val leftNs = namespacesRef(e.leftTable) val rightNs = namespacesRef(e.rightTable) s"${leftNs.nsDatabase}.${leftNs.nsTable} <-> ${rightNs.nsDatabase}.${rightNs.nsTable}" }).toString }\n无奈与根节点(${eventNamespace.nsDatabase}.${eventNamespace.nsTable})建设关系") } val queue = new mutable.Queue[TableRelationTreeNode] queue.enqueue(rootTreeNode) // 广度优先遍历,逐层构建SQL语句,保障依赖程序 while (queue.nonEmpty) { val len = queue.size for (i <- 0 until len) { val node = queue.dequeue if (node.value != eventTableId) { val relation = node.parentRelation // 以后节点表 val curNs = namespacesRef(node.value) // 父节点表 val parNs = namespacesRef(relation.leftTable) val curTableName = s"${curNs.nsDatabase}.${curNs.nsTable}" val fields = sourceDataDal.getSourceDataTableField(node.value) val fieldAliasPrefix = s"${curNs.nsDatabase}__${curNs.nsTable}__" // 构建lookup SQL transSqlList += s"pushdown_sql left join with ${curNs.nsSys}.${curNs.nsInstance}.${curNs.nsDatabase}=select ${ fields.map(f => s"$f as $fieldAliasPrefix$f").mkString(", ") } from $curTableName where (${ relation.linkFields.map(_._2.replaceAll(".*\\.", "")).mkString(",") }) in (${relation.linkFields.map(_._1.replace(".","__")).map(e => "${" + e + "}").mkString(",")})"; } node.childs foreach { queue.enqueue(_) } } } } // 输入最终后果集的SparkSQL transSqlList += s"spark_sql= select ${ task.targetCalculate.get.map { e => s"${e.config.replaceAll("(\\w+)\\.(\\w+)\\.(\\w+)", "$1__$2__$3")} as ${e.columnName}" }.mkString(", ") } from ${eventNamespace.nsTable} where ${if (task.filterCondition.getOrElse("") == "") "1=1" else task.filterCondition.get}" transSqlList.toSeq }
测试
我新建了几张测试表,并应用小程序向库中随机生成了一些数据,而后又新建了一个指标表,以此来测试该性能,过程如下
前端配置
关联关系
计算逻辑
形象出的关联关系应为:
------> customer_transaction |customer <---> customer_account_info <---- | ------> customer_seller_relation <-----> seller_info
后盾生成的SQL:
spark_sql =select address AS adp_mock_spr_mirror__customer__address, company AS adp_mock_spr_mirror__customer__company, gender AS adp_mock_spr_mirror__customer__gender, id AS adp_mock_spr_mirror__customer__id, id_card AS adp_mock_spr_mirror__customer__id_card, mobile AS adp_mock_spr_mirror__customer__mobile, real_name AS adp_mock_spr_mirror__customer__real_name, ums_id_ AS adp_mock_spr_mirror__customer__ums_id_, ums_op_ AS adp_mock_spr_mirror__customer__ums_op_, ums_ts_ AS adp_mock_spr_mirror__customer__ums_ts_from customer;pushdown_sql left join with tidb.spr_ods_department.adp_mock_spr_mirror =select account_bank as adp_mock_spr_mirror__customer_account_info__account_bank, account_level as adp_mock_spr_mirror__customer_account_info__account_level, account_no as adp_mock_spr_mirror__customer_account_info__account_no, customer_id as adp_mock_spr_mirror__customer_account_info__customer_id, entry_time as adp_mock_spr_mirror__customer_account_info__entry_time, id as adp_mock_spr_mirror__customer_account_info__id, loc_seller as adp_mock_spr_mirror__customer_account_info__loc_seller, risk_level as adp_mock_spr_mirror__customer_account_info__risk_level, risk_test_date as adp_mock_spr_mirror__customer_account_info__risk_test_date, ums_active_ as adp_mock_spr_mirror__customer_account_info__ums_active_, ums_id_ as adp_mock_spr_mirror__customer_account_info__ums_id_, ums_op_ as adp_mock_spr_mirror__customer_account_info__ums_op_, ums_ts_ as adp_mock_spr_mirror__customer_account_info__ums_ts_from adp_mock_spr_mirror.customer_account_infowhere (id) in ($ { adp_mock_spr_mirror__customer__id });pushdown_sql left join with tidb.spr_ods_department.adp_mock_spr_mirror =select customer_id as adp_mock_spr_mirror__customer_seller_relation__customer_id, id as adp_mock_spr_mirror__customer_seller_relation__id, relation_type as adp_mock_spr_mirror__customer_seller_relation__relation_type, seller_id as adp_mock_spr_mirror__customer_seller_relation__seller_id, ums_active_ as adp_mock_spr_mirror__customer_seller_relation__ums_active_, ums_id_ as adp_mock_spr_mirror__customer_seller_relation__ums_id_, ums_op_ as adp_mock_spr_mirror__customer_seller_relation__ums_op_, ums_ts_ as adp_mock_spr_mirror__customer_seller_relation__ums_ts_, wechat_relation as adp_mock_spr_mirror__customer_seller_relation__wechat_relationfrom adp_mock_spr_mirror.customer_seller_relationwhere (customer_id) in ( $ { adp_mock_spr_mirror__customer_account_info__id } ); pushdown_sql left join with tidb.spr_ods_department.adp_mock_spr_mirror =select balance as adp_mock_spr_mirror__customer_transaction__balance, borrow_loan as adp_mock_spr_mirror__customer_transaction__borrow_loan, comment as adp_mock_spr_mirror__customer_transaction__comment, customer_account_id as adp_mock_spr_mirror__customer_transaction__customer_account_id, customer_id as adp_mock_spr_mirror__customer_transaction__customer_id, deal_abstract_code as adp_mock_spr_mirror__customer_transaction__deal_abstract_code, deal_account_type_code as adp_mock_spr_mirror__customer_transaction__deal_account_type_code, deal_code as adp_mock_spr_mirror__customer_transaction__deal_code, deal_partner_account as adp_mock_spr_mirror__customer_transaction__deal_partner_account, deal_partner_name as adp_mock_spr_mirror__customer_transaction__deal_partner_name, deal_partner_ogr_name as adp_mock_spr_mirror__customer_transaction__deal_partner_ogr_name, deal_partner_org_num as adp_mock_spr_mirror__customer_transaction__deal_partner_org_num, id as adp_mock_spr_mirror__customer_transaction__id, subject as adp_mock_spr_mirror__customer_transaction__subject, transaction_amount as adp_mock_spr_mirror__customer_transaction__transaction_amount, transaction_time as adp_mock_spr_mirror__customer_transaction__transaction_time, ums_active_ as adp_mock_spr_mirror__customer_transaction__ums_active_, ums_id_ as adp_mock_spr_mirror__customer_transaction__ums_id_, ums_op_ as adp_mock_spr_mirror__customer_transaction__ums_op_, ums_ts_ as adp_mock_spr_mirror__customer_transaction__ums_ts_from adp_mock_spr_mirror.customer_transactionwhere (customer_id, customer_account_id) in ( $ { adp_mock_spr_mirror__customer_account_info__id }, $ { adp_mock_spr_mirror__customer_account_info__account_no } );pushdown_sql left join with tidb.spr_ods_department.adp_mock_spr_mirror =select current_bank as adp_mock_spr_mirror__seller_info__current_bank, department_id as adp_mock_spr_mirror__seller_info__department_id, email as adp_mock_spr_mirror__seller_info__email, entry_time as adp_mock_spr_mirror__seller_info__entry_time, id as adp_mock_spr_mirror__seller_info__id, id_card as adp_mock_spr_mirror__seller_info__id_card, leader_id as adp_mock_spr_mirror__seller_info__leader_id, mobile as adp_mock_spr_mirror__seller_info__mobile, name as adp_mock_spr_mirror__seller_info__name, position as adp_mock_spr_mirror__seller_info__position, tenant_id as adp_mock_spr_mirror__seller_info__tenant_id, ums_active_ as adp_mock_spr_mirror__seller_info__ums_active_, ums_id_ as adp_mock_spr_mirror__seller_info__ums_id_, ums_op_ as adp_mock_spr_mirror__seller_info__ums_op_, ums_ts_ as adp_mock_spr_mirror__seller_info__ums_ts_from adp_mock_spr_mirror.seller_infowhere (id) in ( $ { adp_mock_spr_mirror__customer_seller_relation__seller_id } );spark_sql =select adp_mock_spr_mirror__customer_account_info__id as id, adp_mock_spr_mirror__customer__real_name as name, IF(adp_mock_spr_mirror__customer__gender = 0, "0", "1") as sex, adp_mock_spr_mirror__seller_info__department_id as age, adp_mock_spr_mirror__customer__mobile as phone, adp_mock_spr_mirror__seller_info__entry_time as born, adp_mock_spr_mirror__customer__address as address, IF( adp_mock_spr_mirror__customer_transaction__borrow_loan = 1, "1", "0" ) as married, NOW() as create_time, NOW() as update_time, 'P' as zodiacfrom customerwhere 1 = 1;
同步后果
从Spark后盾日志中能够看到,数据曾经失常插入指标表。
结语
以上是树和BFS在理论开发场景中的一个利用,代码实现其实较为简单,重点是实现的思路,当然解决问题的办法并不是惟一的,在此问题中,也能够在构建树的过程中间接构建SQL语句,省去后续的BFS过程,然而我思考到后续可能减少的需要,还是将此处拆成了两步,不便后续在扩大,依据理论场景抉择计划即可。另外,计算逻辑中短少字段强校验,当用户输出谬误字段时在运行期间能力察觉到,思考前期再减少此性能。
有不对的中央欢送斧正,心愿本文对大家有所帮忙。