作者:杨文

DBA,负责客户我的项目的需要与保护,会点数据库,不限于MySQL、Redis、Cassandra、GreenPlum、ClickHouse、Elastic、TDSQL等等。

本文起源:原创投稿

*爱可生开源社区出品,原创内容未经受权不得随便应用,转载请分割小编并注明起源。


一、前情提要:

咱们晓得 cassandra 具备分区容错性和强一致性,然而当数据所在主机产生故障时,该主机对应的数据副本该何去何从呢?是否跟宿主机一样变得不可用呢?想晓得答案的话,就跟我一起往下看吧。

二、试验环境:

集群模式下跨数据中心:

数据中心节点IP种子节点
DC110.186.60.61、10.186.60.7、10.186.60.118、10.186.60.6710.186.60.61、10.186.60.7
DC210.186.60.53、10.186.60.65、10.186.60.94、10.186.60.6810.186.60.53、10.186.60.65

首先一起来瞅一瞅节点退出集群过程中的 owns 变动:

[cassandra@data01 ~]$ nodetool statusDatacenter: dc1=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  Owns (effective)  Host ID                               RackUN  10.186.60.7    88.29 KiB  16      46.0%             4702178e-9878-48dc-97e7-9211b7c9f2e7  rack1UN  10.186.60.118  69.07 KiB  16      37.7%             c920c611-2e8b-472d-93a4-34f1abd5b207  rack1UN  10.186.60.61   88.25 KiB  16      34.2%             af2e0c42-3a94-4647-9716-c484b690899i  rack1Datacenter: dc2=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  Owns (effective)  Host ID                               RackUN  10.186.60.65   69.04 KiB  16      41.4%             89683bf8-aff8-4fdc-9525-c14764cf2d4f  rack2UN  10.186.60.53   83.18 KiB  16      41.7%             7c91c707-abac-44f2-811O-b18f03f03d13  rack2[cassandra@data01 ~]$ nodetool statusDatacenter: dc1=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  Owns (effective)  Host ID                               RackUN  10.186.60.67   74.01 KiB  16      24.7%             9d6d759b-c00c-488b-938d-3e1ef9b92b02  rack1UN  10.186.60.7    88.29 KiB  16      27.5%             4702178e-9878-48dc-97e7-9211b7c9f2e7  rack1UN  10.186.60.118  83.16 KiB  16      28.9%             c920c611-2e8b-472d-93a4-34f1abd5b207  rack1UN  10.186.60.61   88.25 KiB  16      30.3%             af2e0c42-3a94-4647-9716-c484b690899i  rack1Datacenter: dc2=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  Owns (effective)  Host ID                               RackUN  10.186.60.65   83.17 KiB  16      27.7%             89683bf8-aff8-4fdc-9525-c14764cf2d4f  rack2UN  10.186.60.53   83.18 KiB  16      29.8%             7c91c707-abac-44f2-811O-b18f03f03d13  rack2UN  10.186.60.94   69.05 KiB  16      31.1%             c8fa86e4-ee9a-4c62-b00b-d15edc967b9f  rack2[cassandra@data01 ~]$ nodetool statusDatacenter: dc1=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  Owns (effective)  Host ID                               RackUN  10.186.60.67   74.01 KiB  16      21.4%             9d6d759b-c00c-488b-938d-3e1ef9b92b02  rack1UN  10.186.60.7    88.29 KiB  16      25.2%             4702178e-9878-48dc-97e7-9211b7c9f2e7  rack1UN  10.186.60.118  83.16 KiB  16      27.1%             c920c611-2e8b-472d-93a4-34f1abd5b207  rack1UN  10.186.60.61   83.19 KiB  16      28.9%             af2e0c42-3a94-4647-9716-c484b690899i  rack1Datacenter: dc2=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  Owns (effective)  Host ID                               RackUN  10.186.60.68   88.55 KiB  16      21.6%             a7307228-62bb-4354-9853-990cac9614ab  rack2UN  10.186.60.65   83.17 KiB  16      24.0%             89683bf8-aff8-4fdc-9525-c14764cf2d4f  rack2UN  10.186.60.53   83.18 KiB  16      25.4%             7c91c707-abac-44f2-811O-b18f03f03d13  rack2UN  10.186.60.94   69.05 KiB  16      26.4%             c8fa86e4-ee9a-4c62-b00b-d15edc967b9f  rack2

能够看到,刚建设的集群,owns 的总和时刻放弃在 200% ,但单个数据中心的 owns 不是 100% 。

三、具体试验:

3.1、试验1:

[cassandra@data01 ~]$ cqlsh 10.186.60.61 -u cassandra -p cassandraCREATE KEYSPACE "dcdatabase" WITH REPLICATION = {'class':'NetworkTopologyStrategy', 'dc1' : 4, 'dc2' : 4};use dcdatabase;create table test (id int, user_name varchar, primary key (id) );insert into test (id,name) VALUES (1,'test1');insert into test (id,name) VALUES (2,'test2');insert into test (id,name) VALUES (3,'test3');insert into test (id,name) VALUES (4,'test4');insert into test (id,name) VALUES (5,'test5');

查看集群状态:

[cassandra@data01 ~]$ nodetool statusDatacenter: dc1=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.67   96.55 KiB  16      100.0%            9d6d759b-c00c-488b-938d-3e1ef9b92b02  rack1UN  10.186.60.7    88.29 KiB  16      100.0%            4702178e-9878-48dc-97e7-9211b7c9f2e7  rack1UN  10.186.60.118  88.33 KiB  16      100.0%            c920c611-2e8b-472d-93a4-34f1abd5b207  rack1UN  10.186.60.61   88.37 KiB  16      100.0%            af2e0c42-3a94-4647-9716-c484b690899i  rack1Datacenter: dc2=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.68   74.23 KiB  16      100.0%            a7307228-62bb-4354-9853-990cac9614ab  rack2UN  10.186.60.65   83.17 KiB  16      100.0%            89683bf8-aff8-4fdc-9525-c14764cf2d4f  rack2UN  10.186.60.53   88.36 KiB  16      100.0%            7c91c707-abac-44f2-811O-b18f03f03d13  rack2UN  10.186.60.94   74.23 KiB  16      100.0%            c8fa86e4-ee9a-4c62-b00b-d15edc967b9f  rack2

能够看到集群中,每个数据中心的 owns 都是 400% ,合乎四正本的设置;

查看数据在节点上的散布状况:

[cassandra@data01 ~]$ nodetool getendpoints dcdatabase test 110.186.60.710.186.60.9410.186.60.6510.186.60.11810.186.60.6710.186.60.6110.186.60.5310.186.60.68[cassandra@data03 ~]$ nodetool getendpoints dcdatabase test 510.186.60.6710.186.60.9410.186.60.710.186.60.5310.186.60.6510.186.60.11810.186.60.6110.186.60.68

能够看到集群数据分布在所有数据中心的所有节点上,合乎数据的散布原理。

测试并查看集群中呈现故障节点后的数据分布状况:

94机器敞开服务:systemctl stop cassandra[cassandra@data01 ~]$ nodetool statusDatacenter: dc1=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.67   96.55 KiB  16      100.0%            9d6d759b-c00c-488b-938d-3e1ef9b92b02  rack1UN  10.186.60.7    88.29 KiB  16      100.0%            4702178e-9878-48dc-97e7-9211b7c9f2e7  rack1UN  10.186.60.118  88.33 KiB  16      100.0%            c920c611-2e8b-472d-93a4-34f1abd5b207  rack1UN  10.186.60.61   88.37 KiB  16      100.0%            af2e0c42-3a94-4647-9716-c484b690899i  rack1Datacenter: dc2=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.68   74.23 KiB  16      100.0%            a7307228-62bb-4354-9853-990cac9614ab  rack2UN  10.186.60.65   83.17 KiB  16      100.0%            89683bf8-aff8-4fdc-9525-c14764cf2d4f  rack2UN  10.186.60.53   88.36 KiB  16      100.0%            7c91c707-abac-44f2-811O-b18f03f03d13  rack2DN  10.186.60.94   74.23 KiB  16      100.0%            c8fa86e4-ee9a-4c62-b00b-d15edc967b9f  rack2

能够看到,94节点曾经宕掉,然而 dc2 数据中心的 owns 散布并未扭转。

查看数据分布在哪个节点:

[cassandra@data01 ~]$ nodetool getendpoints dcdatabase test 510.186.60.6710.186.60.9410.186.60.710.186.60.5310.186.60.6510.186.60.11810.186.60.6110.186.60.68

能够看到,数据仍散布在94节点上;

把故障节点94移除集群:

[cassandra@data02 ~]$ nodetool removenode c8fa86e4-ee9a-4c62-b00b-d15edc967b9f

查看集群状态:

[cassandra@data01 ~]$ nodetool statusDatacenter: dc1=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.67   96.55 KiB  16      100.0%            9d6d759b-c00c-488b-938d-3e1ef9b92b02  rack1UN  10.186.60.7    88.29 KiB  16      100.0%            4702178e-9878-48dc-97e7-9211b7c9f2e7  rack1UN  10.186.60.118  88.33 KiB  16      100.0%            c920c611-2e8b-472d-93a4-34f1abd5b207  rack1UN  10.186.60.61   88.37 KiB  16      100.0%            af2e0c42-3a94-4647-9716-c484b690899i  rack1Datacenter: dc2=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.68   74.23 KiB  16      100.0%            a7307228-62bb-4354-9853-990cac9614ab  rack2UN  10.186.60.65   83.17 KiB  16      100.0%            89683bf8-aff8-4fdc-9525-c14764cf2d4f  rack2UN  10.186.60.53   88.36 KiB  16      100.0%            7c91c707-abac-44f2-811O-b18f03f03d13  rack2


[cassandra@data02 ~]$ nodetool getendpoints dcdatabase test 510.186.60.6710.186.60.710.186.60.5310.186.60.6510.186.60.11810.186.60.6110.186.60.68

能够看到,数据不在94节点上了;

阐明:对于 cassandra 进行服务或移出集群,仍是能够应用的,只是不能登入本人的 cassandra 数据库,但仍能够登录其余 cassandra 数据库。

3.2、试验2:

CREATE KEYSPACE "dcdatabase" WITH REPLICATION = {'class':'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 3};

查看集群状态:

[cassandra@data01 ~]$ nodetool statusDatacenter: dc1=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.67   96.55 KiB  16      73.2%             9d6d759b-c00c-488b-938d-3e1ef9b92b02  rack1UN  10.186.60.7    89.39 KiB  16      74.7%             4702178e-9878-48dc-97e7-9211b7c9f2e7  rack1UN  10.186.60.118  88.33 KiB  16      77.4%             c920c611-2e8b-472d-93a4-34f1abd5b207  rack1UN  10.186.60.61   88.42 KiB  16      74.7%             af2e0c42-3a94-4647-9716-c484b690899i  rack1Datacenter: dc2=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.68   74.22 KiB  16      100.0%            a7307228-62bb-4354-9853-990cac9614ab  rack2UN  10.186.60.65   84.14 KiB  16      100.0%            89683bf8-aff8-4fdc-9525-c14764cf2d4f  rack2UN  10.186.60.53   88.30 KiB  16      100.0%            7c91c707-abac-44f2-811O-b18f03f03d13  rack2

能够看到集群中,每个数据中心的 owns 都是 300% ,合乎三正本的设置;

测试并查看集群中呈现故障节点后的数据分布状况:

94机器敞开服务,并移除集群:

[cassandra@data02 ~]$ nodetool removenode c8fa86e4-ee9a-4c62-b00b-d15edc967b9f

查看集群状态:

[cassandra@data01 ~]$ nodetool statusDatacenter: dc1=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.67   96.55 KiB  16      73.2%             9d6d759b-c00c-488b-938d-3e1ef9b92b02  rack1UN  10.186.60.7    89.39 KiB  16      74.7%             4702178e-9878-48dc-97e7-9211b7c9f2e7  rack1UN  10.186.60.118  88.33 KiB  16      77.4%             c920c611-2e8b-472d-93a4-34f1abd5b207  rack1UN  10.186.60.61   88.42 KiB  16      74.7%             af2e0c42-3a94-4647-9716-c484b690899i  rack1Datacenter: dc2=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.68   74.22 KiB  16      100.0%            a7307228-62bb-4354-9853-990cac9614ab  rack2UN  10.186.60.65   84.14 KiB  16      100.0%            89683bf8-aff8-4fdc-9525-c14764cf2d4f  rack2UN  10.186.60.53   88.30 KiB  16      100.0%            7c91c707-abac-44f2-811O-b18f03f03d13  rack2

此时,数据不在94节点上了,故障节点上的数据已挪动到其余节点上,因而能够看到,在 dc1 数据中心中,数据随机仍只散布在其中三个节点上,而 dc2 数据中心的数据将散布在了仅有的三个节点上,产生了数据转移;

如果此时 dc2 数据中心还有节点持续故障,那么故障节点上的数据不可能再挪动到其余节点上了,dc1 是不变的,owns 还是300% ,然而 dc2 的 owns都是100% ,没方法故障转移了,只能存在本身的数据了;

此时重启所有主机,所有主机 Cassandra 服务都会开启,包含之前故障模拟的节点也会自启,那么此时就会达到了另一种成果:故障模拟节点后的状态,再增加到了集群中,那么此时数据又会进行了主动的散发。

查看集群状态:

[cassandra@data01 ~]$ nodetool statusDatacenter: dc1=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.67   96.55 KiB  16      73.2%             9d6d759b-c00c-488b-938d-3e1ef9b92b02  rack1UN  10.186.60.7    89.39 KiB  16      74.7%             4702178e-9878-48dc-97e7-9211b7c9f2e7  rack1UN  10.186.60.118  88.33 KiB  16      77.4%             c920c611-2e8b-472d-93a4-34f1abd5b207  rack1UN  10.186.60.61   88.42 KiB  16      74.7%             af2e0c42-3a94-4647-9716-c484b690899i  rack1Datacenter: dc2=================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address        Load       Tokens  owns (effective)  Host ID                               RackUN  10.186.60.68   74.22 KiB  16      73.2%             a7307228-62bb-4354-9853-990cac9614ab  rack2UN  10.186.60.65   84.14 KiB  16      74.7%             89683bf8-aff8-4fdc-9525-c14764cf2d4f  rack2UN  10.186.60.53   88.30 KiB  16      74.7%             7c91c707-abac-44f2-811O-b18f03f03d13  rack2UN  10.186.60.94   90.12 KiB  16      77.4%             c8fa86e4-ee9a-4c62-b00b-d15edc967b9f  rack2