乐趣区

关于spark:akka-cluster-splitbrainresolverSBR

背景

最近我的项目中,用 akka(2.6.8) cluster 在 k8s 做分布式的部署,,其中遇到 unreachable node 如果始终未手动重启,则会导致其余的 node 退出不到 cluster 中来,
具体的操作为其中的一个非 seed node 节点因为 pod 重启导致,部署到了其余的节点上,而之前的 node(ip),cluster 则会始终去连贯该 node(ip),从而导致异样

具体起因剖析

  • 首先咱们先看一下概念 Gossip Convergence, 如下:
 Gossip convergence cannot occur while any nodes are unreachable. The nodes need to become reachable again, or moved to the down and removed states (see the Cluster Membership Lifecycle section).    
 This only blocks the leader from performing its cluster membership management and does not influence the application running on top of the cluster. For example this means that during a network    
 partition it is not possible to add more nodes to the cluster. The nodes can join, but they will not be moved to the up state until the partition has healed or the unreachable nodes have been downed.

翻译过去就是: 当任何节点都不可达时,Gossip convergence 就不达成统一。节点须要再次变得 reachable,或转移到 down 和 removed 状态。这仅阻止领导者执行其集群成员资格治理,并且不会影响在集群顶部运行的应用程序。例如,这意味着在网络分
区期间不可能将更多节点增加到群集。节点能够退出,但在分区修复或无法访问的节点已敞开之前,它们将不会移入 up 状态。
很显著,akka 就是要保障每个节点是 reachable 或者 down,这样能力进行一致性协商

membership-lifecycle 也有提到:

 If a node is unreachable then gossip convergence is not possible and therefore most leader actions are impossible (for instance, allowing a node to become a part of the cluster). To be able to    
 move forward, the node must become reachable again or the node must be explicitly“downed”. This is required because the state of an unreachable node is unknown and the cluster cannot know if 
 the node has crashed or is only temporarily unreachable because of network issues or GC pauses. See the section about User Actions below for ways a node can be downed.

也就是说,如果一个节点是 unreachable 的,必须保障节点是 reachable 或者 downed 状态,因为 unreachable 状态也有可能是网络抖动,或者 GC 导致服务器负载过高引起的,这些状态 akka 无奈分辨,只能有限的进行重连

解决办法

既然有了问题,问题咱们就得解决,解决办法天然就能够去官网解决,通过把 unreachable 节点主动的转化为 down 状态

  • 以 http 申请的模式,被动的进行状态转化
  • 引入 split-brain-resolver(SBR)

第一种形式自行钻研,咱们采纳第二种形式:
其中 SBR 分 tatic-quorum, keep-majority, keep-oldest, down-all, lease-majority 五种 strategies
咱们采纳 keep-majority 策略,其中五种策略的优缺点以及应用场景自行通过官网 strategies 进行剖析
咱们看一下 keep-majority 策略下的 akka 配置

 akka.coordinated-shutdown.exit-jvm = on
 akka.coordinated-shutdown.exit-code = 0
 akka.cluster.downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
 akka.cluster.split-brain-resolver.down-all-when-unstable = off
 akka.cluster.split-brain-resolver.stable-after = 20s
 akka.cluster.split-brain-resolver.active-strategy = keep-majority
 akka.cluster.split-brain-resolver.keep-majority.role = "admin"

| 名词 | 阐明
|—|—
|akka.coordinated-shutdown.exit-jvm| 当节点从 cluster 中移除时,是否退出 jvm,可选为 on off
|akka.coordinated-shutdown.exit-code| 退出时的状态码
|akka.cluster.downing-provider-class| 配置为 akka.cluster.sbr.SplitBrainResolverProvider,示意启动 SBR
|akka.cluster.split-brain-resolver.down-all-when-unstable| 当 cluster 处于不稳固状态多久,会敞开所有节点,可选 on off 或者持续时间,如 15s
|akka.cluster.split-brain-resolver.stable-after| 节点处于 unreachable 多久,SBR 开始进行节点 down 操作
|akka.cluster.split-brain-resolver.active-strategy |keep-majority,启动的策略
|akka.cluster.split-brain-resolver.keep-majority.role| 设置只有该 role 能力进行做 SBR 决定
留神:对于 akka.cluster.split-brain-resolver.keep-majority.role,如果 cluster 因为其余起因,导致只存在多数节点(小于集群节点的一半),而该多数节点的 role 刚好等于该值,则该多数节点不会退出,
如果不配置该项,则多数节点就会全副退出, 从而导致整个集群 down

具体解释参照官网 configuration

退出移动版