关于后端:OpenSearch-Segment-Replication-初体验

作为一个分布式数据系统，Elasticsearch/OpenSearch 也有主正本 (Primary Shard ) 和正本 (Replica Shard ) 的概念。

Replica 是 Primary 的 Clone (复制体)，这样就能够实现零碎的高可用，具备容灾能力。

Replica 从 Primary 复制数据的过程被成为数据复制 (Data Replication )，Data Replication 的外围考量指标是 Replica 和 Primary 的提早(Lag ) 大小，如果 Lag 始终为 0，那么就是实时复制，可靠性最高。

Data Replication 的计划有很多，接下来次要介绍基于文档的复制计划(Document Replication ) 和基于文件的复制计划 (Segment Replication )。

Elasticsearch/OpenSearch 目前采纳的是基于文档的复制计划，整个过程如下图所示：

TODO

Client 发送写申请到 Primary Shard Node
Primary Shard Node 将相干文档先写入本地的 translog，按需进行 refresh
上述步骤执行胜利后，Primary Shard Node 转发写申请到 Replica Shard Nodes，此处转发的内容是理论的文档
Replica Shard Node 接管到写申请后，先写入本地的 translog，按需进行 refresh，返回 Primary Shard Node 执行胜利
Primary Shard Node 返回 Client 写胜利。
后续 Primary Shard Node 和 Replica Shard Node 会依照各自的配置独立进行 refresh 行为，生成各自的 segment 文件。

这里要留神的一点是：Primary Shard 和 Replica Shard 的 refresh 是 独立的工作 ， 执行机会和工夫会有所差别，这也会导致两边理论生成和应用的 segment 文件有差别。

以上便是 Document Replication 的繁难流程，对残缺流程感兴趣的，能够通过上面的连贯查看更具体的介绍。

https://www.elastic.co/guide/…

elasticsearch 数据写入最耗时的局部是生成 segment 文件的过程，因为这里波及到分词、字典生成等等步骤，须要很多 CPU 和 Memory 资源。

而 Document Replication 计划须要在 Primary Node 和 Replica Nodes 上都执行 segment 文件的生成步骤，然而在 Replica Nodes 上的执行理论是一次节约，如果能够防止这次运算，将节俭不少 CPU 和 Memory 资源。

解决的办法也很简略，等 Primary Node 运行结束后，间接将生成的 segment 文件复制到 Replica Nodes 就好了。这种计划就是 Segment Replication。

Segment Replication 的大抵流程如下图所示：

TODO:

Client 发送写申请到 Primary Shard Node
Primary Shard Node 将相干文档先写入本地的 translog，按需和相干配置进行 refresh，此处不是肯定触发 refresh
上述步骤执行胜利后，Primary Shard Node 转发写申请到 Replica Shard Nodes，此处转发的内容是理论的文档
Replica Shard Node 接管到写申请后，写入本地的 translog，而后返回 Primary Shard Node 执行胜利
Primary Shard Node 返回 Client 写胜利。
Primary Shard Node 在触发 refresh 后，会告诉 Replica Shard Nodes 同步新的 segment 文件。
Replica Shard Nodes 会比照本地和 Primary Shard Node 上的 segment 文件列表差别，而后申请同步本地缺失和产生变更的 segment 文件。
Primary Shard Node 依据 Replica Shard Nodes 的相干申请实现 segment 文件的发送
Replica Shard Nodes 在残缺接管 segment 文件后，刷新 Lucene 的 DirectoryReader 载入最新的文件，使新文档能够被查问

这里和 Document Replication 最大的不同是 Replica Shard Nodes 不会在独立生成 segment 文件，而是间接从 Primary Shard Node 同步，本地的 translog 只是为了实现数据的可靠性，在 segment 文件同步过去后，就能够删除。

以上便是 Segment Replication 的繁难流程，对残缺流程感兴趣的，能够通过上面的连贯查看更具体的介绍。

https://github.com/opensearch…

OpenSearch 在 2.3 版本中公布了试验版本的 Segment Replication 性能，接下来就让咱们一起体验一下吧~

本次体验基于 docker-compose 来执行，如下为相干内容(docker-compose.yml)：

version: '3'
services:
  opensearch-node1:
    image: opensearchproject/opensearch:2.3.0
    container_name: os23-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
      - plugins.security.disabled=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m -Dopensearch.experimental.feature.replication_type.enabled=true" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
        hard: 65536
    volumes:
      - ./os23data1:/usr/share/opensearch/data
    ports:
      - 9200:9200
      - 9600:9600 # required for Performance Analyzer
    networks:
      - opensearch-net
  opensearch-node2:
    image: opensearchproject/opensearch:2.3.0
    container_name: os23-node2
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true
      - plugins.security.disabled=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m -Dopensearch.experimental.feature.replication_type.enabled=true"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - ./os23data2:/usr/share/opensearch/data
    networks:
      - opensearch-net
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.3.0
    container_name: os23-dashboards
    ports:
      - 5601:5601
    expose:
      - "5601"
    environment:
      OPENSEARCH_HOSTS: '["http://opensearch-node1:9200","http://opensearch-node2:9200"]'
      DISABLE_SECURITY_DASHBOARDS_PLUGIN: "true"
    networks:
      - opensearch-net

networks:
  opensearch-net:

简略阐明如下：

为了演示不便，敞开了平安个性
要在 OPENSEARCH_JAVA_OPTS 中增加 -Dopensearch.experimental.feature.replication_type.enabled=true 能力开启 segment replication 性能

执行如下命令运行 OpenSearch Cluster:

docker-compose -f docker-compose.yml up

运行胜利后，能够拜访 http://127.0.0.1:5601 关上 Dashboards 界面，进入 Dev Tools 中执行后续的操作

测试思路如下：

创立两个 index，一个默认配置，一个启用 segment replication，主分片数为 1，正本数为 1
向两个 index 中插入若干条数据
比拟两个 index 中 segment file 的数量和大小

相干命令如下：

PUT /test-rep-by-doc
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1
    }
  }
}

GET test-rep-by-doc/_settings

POST test-rep-by-doc/_doc
{"name": "rep by doc"}

GET _cat/shards/test-rep-by-doc?v

GET _cat/segments/test-rep-by-doc?v&h=index,shard,prirep,segment,generation,docs.count,docs.deleted,size&s=index,segment,prirep


PUT /test-rep-by-seg
{
  "settings": {
    "index": {
      "replication.type": "SEGMENT",
      "number_of_shards": 1,
      "number_of_replicas": 1
    }
  }
}


GET test-rep-by-seg/_settings

POST test-rep-by-seg/_doc
{"name": "rep by seg"}

GET _cat/shards/test-rep-by-seg

GET _cat/segments/test-rep-by-seg?v&h=index,shard,prirep,segment,generation,docs.count,docs.deleted,size&s=index,segment,prirep

插入文档后，通过 _cat/segments 能够失去 segment file 列表，而后通过 size 一列能够比照 segment 文件大小。

如下是默认基于文档复制的后果：

index           shard prirep segment generation docs.count docs.deleted  size
test-rep-by-doc 0     p      _0               0          2            0 3.7kb
test-rep-by-doc 0     r      _0               0          1            0 3.6kb
test-rep-by-doc 0     p      _1               1          2            0 3.7kb
test-rep-by-doc 0     r      _1               1          3            0 3.8kb
test-rep-by-doc 0     p      _2               2          1            0 3.6kb
test-rep-by-doc 0     r      _2               2          3            0 3.8kb
test-rep-by-doc 0     p      _3               3          6            0 3.9kb
test-rep-by-doc 0     r      _3               3          6            0 3.9kb
test-rep-by-doc 0     p      _4               4          5            0 3.9kb
test-rep-by-doc 0     r      _4               4          6            0 3.9kb
test-rep-by-doc 0     p      _5               5          6            0 3.9kb
test-rep-by-doc 0     r      _5               5          6            0 3.9kb
test-rep-by-doc 0     p      _6               6          4            0 3.8kb
test-rep-by-doc 0     r      _6               6          1            0 3.6kb

从中能够看到，尽管 Primary Shard 和 Replica Shard 的 segment 数雷同，然而 size 大小是不同的，这也阐明其底层的 segment 文件是独立治理的。

如下是基于 Segment 复制的后果：

index           shard prirep segment generation docs.count docs.deleted  size
test-rep-by-seg 0     p      _0               0          2            0 3.7kb
test-rep-by-seg 0     r      _0               0          2            0 3.7kb
test-rep-by-seg 0     p      _1               1          7            0   4kb
test-rep-by-seg 0     r      _1               1          7            0   4kb
test-rep-by-seg 0     p      _2               2          5            0 3.9kb
test-rep-by-seg 0     r      _2               2          5            0 3.9kb

从中能够看到 Primary Shard 和 Replica Shard 的 segment 完全一致。

除此之外也能够从磁盘文件中比照，同样能够得出雷同的论断：Segment Replication 是基于文件的数据复制计划，Primary 和 Replica 的 segment 文件列表完全相同。

依据 OpenSearch 社区的初步测试，Segment Replication 相较于 Document Replication 的性能后果如下：

在 Replica Nodes 上，CPU 和 Memory 资源 缩小 40%~50%
写入性能方面，整体 吞吐量晋升约 50%，P99 提早降落了 20% 左右

这个测试后果还是很迷人的，但 Segment Replication 也有其本身的局限，上面简略列几点(不肯定精确)：

Segment Replication 对于 网络带宽资源要求更高，目前测试中发现有近 1 倍的增长，须要更正当的调配 Primary Shard 到不同的 Node 上，以扩散网络带宽压力
Segment Replication 可能会因为文件传输的提早而导致 Replica Shard 上可搜寻的文档短时间内与 Primary Shard 不统一
Replica Shard 降级为 Primary Shard 的工夫可能会因为重放 translog 文件而变长，导致 Cluster 不稳固

情谊提醒下，因为该个性目前还是 试验阶段，还不具备上生产环境的能力，大家能够继续关注~

本文由 mdnice 多平台公布

关于后端:OpenSearch-Segment-Replication-初体验

1 Data Replication 简介

1.1 Document Replication

1.2 Segment Replication

2 Segment Replication 初体验

2.1 筹备 docker 环境和相干文件

2.2 运行 OpenSearch 集群

2.3 测试 Segment Replication

3 总结