背景
最近我的项目中,用akka(2.6.8) cluster在k8s做分布式的部署,,其中遇到unreachable node 如果始终未手动重启,则会导致其余的node退出不到cluster中来,
具体的操作为其中的一个非seed node节点因为pod 重启导致,部署到了其余的节点上,而之前的node(ip),cluster则会始终去连贯该node(ip),从而导致异样
具体起因剖析
- 首先咱们先看一下概念Gossip Convergence,如下:
Gossip convergence cannot occur while any nodes are unreachable. The nodes need to become reachable again, or moved to the down and removed states (see the Cluster Membership Lifecycle section). This only blocks the leader from performing its cluster membership management and does not influence the application running on top of the cluster. For example this means that during a network partition it is not possible to add more nodes to the cluster. The nodes can join, but they will not be moved to the up state until the partition has healed or the unreachable nodes have been downed.
翻译过去就是: 当任何节点都不可达时,Gossip convergence就不达成统一。节点须要再次变得reachable,或转移到down和removed状态。这仅阻止领导者执行其集群成员资格治理,并且不会影响在集群顶部运行的应用程序。例如,这意味着在网络分
区期间不可能将更多节点增加到群集。节点能够退出,但在分区修复或无法访问的节点已敞开之前,它们将不会移入up状态。
很显著,akka就是要保障每个节点是reachable或者down,这样能力进行一致性协商
membership-lifecycle也有提到:
If a node is unreachable then gossip convergence is not possible and therefore most leader actions are impossible (for instance, allowing a node to become a part of the cluster). To be able to move forward, the node must become reachable again or the node must be explicitly “downed”. This is required because the state of an unreachable node is unknown and the cluster cannot know if the node has crashed or is only temporarily unreachable because of network issues or GC pauses. See the section about User Actions below for ways a node can be downed.
也就是说,如果一个节点是unreachable的,必须保障节点是reachable或者downed状态,因为unreachable状态也有可能是网络抖动,或者GC导致服务器负载过高引起的,这些状态akka无奈分辨,只能有限的进行重连
解决办法
既然有了问题,问题咱们就得解决,解决办法天然就能够去官网解决,通过把unreachable节点主动的转化为down状态
- 以http申请的模式,被动的进行状态转化
- 引入split-brain-resolver(SBR)
第一种形式自行钻研,咱们采纳第二种形式:
其中SBR分tatic-quorum, keep-majority, keep-oldest, down-all, lease-majority 五种strategies
咱们采纳keep-majority策略,其中五种策略的优缺点以及应用场景自行通过官网strategies进行剖析
咱们看一下keep-majority策略下的akka配置
akka.coordinated-shutdown.exit-jvm = on akka.coordinated-shutdown.exit-code = 0 akka.cluster.downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider" akka.cluster.split-brain-resolver.down-all-when-unstable = off akka.cluster.split-brain-resolver.stable-after = 20s akka.cluster.split-brain-resolver.active-strategy = keep-majority akka.cluster.split-brain-resolver.keep-majority.role = "admin"
|名词|阐明
|---|---
|akka.coordinated-shutdown.exit-jvm|当节点从cluster中移除时,是否退出jvm,可选为on off
|akka.coordinated-shutdown.exit-code|退出时的状态码
|akka.cluster.downing-provider-class|配置为akka.cluster.sbr.SplitBrainResolverProvider,示意启动SBR
|akka.cluster.split-brain-resolver.down-all-when-unstable|当cluster处于不稳固状态多久,会敞开所有节点,可选on off或者持续时间,如15s
|akka.cluster.split-brain-resolver.stable-after| 节点处于unreachable多久,SBR开始进行节点down操作
|akka.cluster.split-brain-resolver.active-strategy |keep-majority,启动的策略
|akka.cluster.split-brain-resolver.keep-majority.role|设置只有该role能力进行做SBR决定
留神:对于akka.cluster.split-brain-resolver.keep-majority.role,如果cluster因为其余起因,导致只存在多数节点(小于集群节点的一半),而该多数节点的role刚好等于该值,则该多数节点不会退出,
如果不配置该项,则多数节点就会全副退出,从而导致整个集群down
具体解释参照官网configuration