关于spark:akka-cluster-splitbrainresolverSBR

背景

最近我的项目中，用akka(2.6.8) cluster在k8s做分布式的部署，，其中遇到unreachable node 如果始终未手动重启，则会导致其余的node退出不到cluster中来，
具体的操作为其中的一个非seed node节点因为pod 重启导致，部署到了其余的节点上，而之前的node(ip)，cluster则会始终去连贯该node(ip)，从而导致异样

具体起因剖析

首先咱们先看一下概念Gossip Convergence,如下：

 Gossip convergence cannot occur while any nodes are unreachable. The nodes need to become reachable again, or moved to the down and removed states (see the Cluster Membership Lifecycle section).    
 This only blocks the leader from performing its cluster membership management and does not influence the application running on top of the cluster. For example this means that during a network    
 partition it is not possible to add more nodes to the cluster. The nodes can join, but they will not be moved to the up state until the partition has healed or the unreachable nodes have been downed.

翻译过去就是: 当任何节点都不可达时，Gossip convergence就不达成统一。节点须要再次变得reachable，或转移到down和removed状态。这仅阻止领导者执行其集群成员资格治理，并且不会影响在集群顶部运行的应用程序。例如，这意味着在网络分
区期间不可能将更多节点增加到群集。节点能够退出，但在分区修复或无法访问的节点已敞开之前，它们将不会移入up状态。
很显著，akka就是要保障每个节点是reachable或者down，这样能力进行一致性协商

membership-lifecycle也有提到:

 If a node is unreachable then gossip convergence is not possible and therefore most leader actions are impossible (for instance, allowing a node to become a part of the cluster). To be able to    
 move forward, the node must become reachable again or the node must be explicitly “downed”. This is required because the state of an unreachable node is unknown and the cluster cannot know if 
 the node has crashed or is only temporarily unreachable because of network issues or GC pauses. See the section about User Actions below for ways a node can be downed.

也就是说，如果一个节点是unreachable的，必须保障节点是reachable或者downed状态，因为unreachable状态也有可能是网络抖动，或者GC导致服务器负载过高引起的，这些状态akka无奈分辨，只能有限的进行重连

解决办法

既然有了问题，问题咱们就得解决，解决办法天然就能够去官网解决，通过把unreachable节点主动的转化为down状态

以http申请的模式，被动的进行状态转化
引入split-brain-resolver(SBR)

第一种形式自行钻研，咱们采纳第二种形式：
其中SBR分tatic-quorum, keep-majority, keep-oldest, down-all, lease-majority 五种strategies
咱们采纳keep-majority策略，其中五种策略的优缺点以及应用场景自行通过官网strategies进行剖析
咱们看一下keep-majority策略下的akka配置

 akka.coordinated-shutdown.exit-jvm = on
 akka.coordinated-shutdown.exit-code = 0
 akka.cluster.downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
 akka.cluster.split-brain-resolver.down-all-when-unstable = off
 akka.cluster.split-brain-resolver.stable-after = 20s
 akka.cluster.split-brain-resolver.active-strategy = keep-majority
 akka.cluster.split-brain-resolver.keep-majority.role = "admin"

具体解释参照官网configuration

关于spark:akka-cluster-splitbrainresolverSBR

背景

具体起因剖析

解决办法

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

关于spark:akka-cluster-splitbrainresolverSBR

背景

具体起因剖析

解决办法

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复