关于运维:RedisJson中文全文检索

5次阅读

共计 10764 个字符，预计需要花费 27 分钟才能阅读完成。

最近网上比拟火的 RedisJson，置信大家都不生疏，还有一篇性能贴，说是 RedisJson 横空出世，性能碾压 ES 和 Mongo！，当然这些几百倍的晋升可能比拟主观，我比较关心的是 RedisJson 的 json 反对状况，全文检索性能，以及反对的中文分词

1、官网有 30 天收费试用，内存有 30M，创立一个实例即可，可用于测试

可应用 redis-cli 进行连贯测试

[root@server bin]# ./redis-cli -h redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com -p 17137 -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>

2、能够本人装置 reJson 模块

下载门路：https://redis.com/redis-enter…

装置：https://oss.redis.com/redisjs…

[root@server bin]# ./redis-server --loadmodule /opt/thunisoft/redis/redisjson/rejson.so 
82538:C 29 Dec 2021 18:41:09.585 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
82538:C 29 Dec 2021 18:41:09.585 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=82538, just started
82538:C 29 Dec 2021 18:41:09.585 # Configuration loaded
82538:M 29 Dec 2021 18:41:09.587 * monotonic clock: POSIX clock_gettime
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 6.2.6 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                  
 ('      ,       .-`  | `,)     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 82538
  `-._    `-._  `-./  _.-'_.-'                                   
 |`-._`-._    `-.__.-'_.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           https://redis.io       
  `-._    `-._`-.__.-'_.-'    _.-'|`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'`-._    `-.__.-'    _.-'`-._        _.-'                                           
              `-.__.-'                                               

82538:M 29 Dec 2021 18:41:09.589 # Server initialized
82538:M 29 Dec 2021 18:41:09.589 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> version: 20006 git sha: db3329c branch: HEAD
82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> Exported RedisJSON_V1 API
82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> Enabled diskless replication
82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> Created new data type 'ReJSON-RL'
82538:M 29 Dec 2021 18:41:09.591 * Module 'ReJSON' loaded from /opt/thunisoft/redis/redisjson/rejson.so
82538:M 29 Dec 2021 18:41:09.602 * Loading RDB produced by version 6.2.6
82538:M 29 Dec 2021 18:41:09.602 * RDB age 98297 seconds
82538:M 29 Dec 2021 18:41:09.603 * RDB memory usage when created 0.77 Mb
82538:M 29 Dec 2021 18:41:09.603 # Done loading RDB, keys loaded: 2, keys expired: 0.
82538:M 29 Dec 2021 18:41:09.603 * DB loaded from disk: 0.011 seconds
82538:M 29 Dec 2021 18:41:09.603 * Ready to accept connections

批改 redis.conf

/opt/thunisoft/redis/bin/redis.conf
-- 增加
loadmodule /opt/thunisoft/redis/redisjson/rejson.so

而后重启 redis，JSON.SET 曾经可用

[root@server bin]# sh start.sh 
[root@server bin]# ./redis-cli   -a 123456 
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
127.0.0.1:6379> JSON.SET jsonkey   .  '{"a":"b","c":["1","2","3"]}'
OK
127.0.0.1:6379> JSON.GET jsonkey
"{\"a\":\"b\",\"c\":[\"1\",\"2\",\"3\"]}"
127.0.0.1:6379> JSON.GET jsonkey .a
"\"b\""

127.0.0.1:6379>  JSON.SET doc . '{"a":2,"b": 3}'
OK

SON.SET 是 json 设置命令
doc 是 key
. 是 json 文档的 root，前面的一串是具体的 json 数据值
如果应用的是 RedisJson2.0+ 版本，能够将. 替换为，JSON.SET doc $ ‘{“a”:2, “b”: 3}’

JSON.GET 获取 json 值

127.0.0.1:6379>  JSON.GET doc
"{\"a\":2,\"b\":3}"
127.0.0.1:6379>  JSON.GET doc a
"2"

嵌套构造，获取 json 值

127.0.0.1:6379> JSON.SET doc $ '{"a":2,"b": 3,"nested": {"a": 4,"b": null},"c":{"b":4}}'
OK
127.0.0.1:6379> JSON.GET doc b
"3"
-- $..b 能够获取所有 b 的值
127.0.0.1:6379> JSON.GET doc $..b
"[3,null,4]"

JSON.STRAPPEND <key> [path] <json-string>
将 json-string 值附加到字符串中 path 。path 如果未提供，则默认为 root。

127.0.0.1:6379> JSON.SET doc $ '{"a":"foo","nested": {"a":"hello"},"nested2": {"a": 31}}'
OK
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":\"foo\",\"nested\":{\"a\":\"hello\"},\"nested2\":{\"a\":31}}]"
127.0.0.1:6379> 
127.0.0.1:6379> JSON.STRAPPEND doc $..a '"baz"'
1) (integer) 6
2) (integer) 8
3) (nil)
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":\"foobaz\",\"nested\":{\"a\":\"hellobaz\"},\"nested2\":{\"a\":31}}]"

127.0.0.1:6379>  JSON.SET doc $ '{"a": 1,"nested": {"a": 2,"b": 3}}'
OK
127.0.0.1:6379> JSON.get doc
"{\"a\":1,\"nested\":{\"a\":2,\"b\":3}}"
127.0.0.1:6379> 
127.0.0.1:6379> 
-- 删除
127.0.0.1:6379> JSON.DEL doc $..a
(integer) 2
127.0.0.1:6379> 
127.0.0.1:6379> JSON.get doc
"{\"nested\":{\"b\":3}}"

语法：JSON.ARRAPPEND <key> <path> <json> [json …]

将 json 值附加到数组中 path 的最初一个元素之后。

127.0.0.1:6379> JSON.SET doc $ '{"a":[1],"nested": {"a": [1,2]},"nested2": {"a": 42}}'
OK
127.0.0.1:6379> JSON.ARRAPPEND doc $..a 3 4
1) (integer) 3
2) (integer) 4
3) (nil)
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":[1,3,4],\"nested\":{\"a\":[1,2,3,4]},\"nested2\":{\"a\":42}}]"

json 中嵌套数组，蕴含多条记录，相似于表

127.0.0.1:6379> JSON.SET testarray .  '{"employees":[         {"name":"Alpha","email":"alpha@gmail.com","age":23},         {"name":"Beta","email":"beta@gmail.com","age":28},       {"name":"Gamma","email":"gamma@gmail.com","age":33},         {"name":"Theta","email":"theta@gmail.com","age":41}    ]}'
OK
127.0.0.1:6379> 
127.0.0.1:6379> 
127.0.0.1:6379> 
127.0.0.1:6379> JSON.get testarray
"{\"employees\":[{\"name\":\"Alpha\",\"email\":\"alpha@gmail.com\",\"age\":23},{\"name\":\"Beta\",\"email\":\"beta@gmail.com\",\"age\":28},{\"name\":\"Gamma\",\"email\":\"gamma@gmail.com\",\"age\":33},{\"name\":\"Theta\",\"email\":\"theta@gmail.com\",\"age\":41}]}"

语法：JSON.ARRINSERT <key> <path> <index> <json> [json …]

将值插入到数组中

127.0.0.1:6379> JSON.SET doc $ '{"a":[3],"nested": {"a": [3,4]}}'
OK
127.0.0.1:6379> JSON.ARRINSERT doc $..a 0 1 2 5
1) (integer) 4
2) (integer) 5
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":[1,2,5,3],\"nested\":{\"a\":[1,2,5,3,4]}}]"

还有许多 JSON 操作，可参考：https://oss.redis.com/redisjs…

应用文档：https://developer.redis.com/h…

能够看到默认状况下，中文是不会进行分词，只是默认的依照逗号进行宰割。英文反对全文检索

查问材料得悉 redisjson 在创立索引的时候能够指定分词

FT.CREATE {index}
    [ON {data_type}]
       [PREFIX {count} {prefix} [{prefix} ...]
       [FILTER {filter}]
       [LANGUAGE {default_lang}]
       [LANGUAGE_FIELD {lang_attribute}]
       [SCORE {default_score}]
       [SCORE_FIELD {score_attribute}]
       [PAYLOAD_FIELD {payload_attribute}]
    [MAXTEXTFIELDS] [TEMPORARY {seconds}] [NOOFFSETS] [NOHL] [NOFIELDS] [NOFREQS] [SKIPINITIALSCAN]
    [STOPWORDS {num} {stopword} ...]
    SCHEMA {identifier} [AS {attribute}]
        [TEXT [NOSTEM] [WEIGHT {weight}] [PHONETIC {matcher}] | NUMERIC | GEO | TAG [SEPARATOR {sep}] [CASESENSITIVE] [SORTABLE [UNF]] [NOINDEX]] |
        [VECTOR {algorithm} {count} [{attribute_name} {attribute_value} ...]] ...

json 创立索引
ON JSON，如果是文本，则指定 TEXT

-- 新建一个索引：i_index1
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   FT.CREATE i_index1 ON JSON LANGUAGE chinese SCHEMA $.title TEXT 
OK
-- 插入数据
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   JSON.SET myDoc $ '{"title":" 云南省昆明市盘龙区 ","content":"bar1"}'
OK
-- 查问昆明市，能够查问出后果
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "昆明市" LANGUAGE chinese 
1) (integer) 1
2) "myDoc"
3) 1) "$"
   2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"

分词办法

从上面的后果来看，查问的 云南省 ， 昆明市 ， 盘龙区 ，均能够查问进去，然而查问 昆明 ， 云南 ， 昆盘 等就查问不进去。

redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "云南省" LANGUAGE chinese 
1) (integer) 1
2) "myDoc"
3) 1) "$"
 2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "区" LANGUAGE chinese 
1) (integer) 0

redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "云南省" LANGUAGE chinese 
1) (integer) 0
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "昆明市" LANGUAGE chinese 
1) (integer) 1
2) "myDoc"
3) 1) "$"
 2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "昆明" LANGUAGE chinese 
1) (integer) 0
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>       FT.SEARCH i_index1 "盘龙区" LANGUAGE chinese 
1) (integer) 1
2) "myDoc"
3) 1) "$"
 2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   FT.SEARCH i_index1 "盘龙" LANGUAGE chinese 
1) (integer) 0
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   FT.SEARCH i_index1 "区" LANGUAGE chinese 
1) (integer) 0

测试 南京长江大桥

能够看到将 南京长江大桥 ，查问 南京 ， 长江 和大桥 没有后果，查问 南京市 , 长江大桥 有后果，擦测可能宰割成了 南京市 ， 长江大桥

redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> JSON.SET myDoc $ '{"title":" 南京市长江大桥 ","content":"bar1"}'
OK
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "南京市" LANGUAGE chinese 
1) (integer) 1
2) "myDoc"
3) 1) "$"
   2) "{\"title\":\"\xe5\x8d\x97\xe4\xba\xac\xe5\xb8\x82\xe9\x95\xbf\xe6\xb1\x9f\xe5\xa4\xa7\xe6\xa1\xa5\",\"content\":\"bar1\"}"
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "长江" LANGUAGE chinese
1) (integer) 0
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "大桥" LANGUAGE chinese 
1) (integer) 0
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "长江大桥" LANGUAGE chinese 
1) (integer) 1
2) "myDoc"
3) 1) "$"
   2) "{\"title\":\"\xe5\x8d\x97\xe4\xba\xac\xe5\xb8\x82\xe9\x95\xbf\xe6\xb1\x9f\xe5\xa4\xa7\xe6\xa1\xa5\",\"content\":\"bar1\"}"
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "南京" LANGUAGE chinese 
1) (integer) 0

创立索引时须要指定 LANGUAGE chinese

redisjson：https://oss.redis.com/redisea…

全文检索反对的语言：

arabic
armenian
danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
portuguese
romanian
russian
serbian
spanish
swedish
tamil
turkish
yiddish
chinese (see below)

RediSearch 默认应用了 Friso 来进行中文分词

Friso：Friso 是应用 ANSI C 语言开发的一款开源中文分词器，应用风行的 mmseg
算法实现。齐全基于模块化设计和实现，能够很不便的植入其余程序中，例如：MySQL，PHP，源码无需批改就能在各种平台下编译应用，同时反对对 UTF-8/GBK 编码的切分。

装置 Friso 分词，测试发现的确是这样

[root@server friso-1.6.1-release]# ./src/friso -init ./friso.ini 
Initialized in 0.340000sec
Mode: Complex
+-Version: 1.6.1 (UTF-8)
+-----------------------------------------------------------+
| friso - a chinese word segmentation writen by c.          |
| bug report email - chenxin619315@gmail.com.               |
| or: visit http://code.google.com/p/friso.                 |
|     java edition for http://code.google.com/p/jcseg       |
| type 'quit' to exit the program.                          |
+-----------------------------------------------------------+
friso>> 南京市长江大桥
分词后果:
南京市 长江大桥 
Done, cost < 0.000000sec

friso>> 云南省昆明市盘龙区
分词后果:
云南省 昆明市 盘龙区 
Done, cost < 0.000000sec
friso>>

Friso 基于 mmseg 算法实现，以正向最大匹配为主，多种打消歧义的规定为辅

mmseg 分词：http://technology.chtsai.org/…

每次从一个残缺的句子里，依照从左向右的程序，辨认出多种不同的 3 个词的组合；而后依据上面的 4 条消歧规定，确定最佳的备选词组合；抉择备选词组合中的第 1 个词，作为 1 次迭代的分词后果；残余的 2 个词持续进行下一轮的分词运算。采纳这种方法的益处是，为传统的前向最大匹配算法退出了上下文信息，解决了其每次选词只思考词自身，而漠视上下文相干词的问题。4 条消歧规定包含，1）备选词组合的长度之和最大。2）备选词组合的均匀词长最大；3）备选词组合的词长变动最小；4）备选词组合中，单字词的呈现频率统计值最高。

scws 分词，会分的很细，根本涵盖所有词组的拆分

postgres=# select to_tsvector('testzhcfg','南京市长江大桥');
                                      to_tsvector                                       
----------------------------------------------------------------------------------------
 '南京':2 '南京市':1 '大':9 '大桥':6 '市':3 '桥':10 '江':8 '长':7 '长江':5 '长江大桥':4
(1 row)

postgres=#  select to_tsvector('testzhcfg','云南省昆明市盘龙区');
                                             to_tsvector                                             
-----------------------------------------------------------------------------------------------------
 '区':12 '龙':11 '云南':2 '云南省':1 '市':7 '昆':6,10 '盘龙':9 '盘龙区':8 '昆明':5 '昆明市':4 '省':3
(1 row)

es 有专门的分词引擎，反对多种分词器，常应用的 IK 分词

1、RedisJson 反对 JSON 全文检索，应用 Friso 分词，该分词分的不是特地细，会导致某些二元词组查问不到

2、比照 JSON 的操作性能比拟全面，RedisJson 进去没多久，网上的利用场景比拟少

本文由博客一文多发平台 OpenWrite 公布！

正文完

运维

发表至：运维

2021-12-31

0

关于运维:JuiceFS-V10-RC1-发布大幅优化-dumpload-命令性能-深度用户不容错过

关于运维:观察融云百幄为政企数智办公按下快进键

关于运维:Linux系统文件管理pwd命令-–-显示当前工作目录的路径

关于运维:时速云入选数据猿2022中国云原生领域最具商业合作价值企业

关于google:复旦教授发现-400-高危漏洞谷歌-16-个月后终于修复-Android-设备不用变砖了

关于运维:RedisJson中文全文检索

RedisJson- 中文全文检索

RedisJson

装置

JSON 应用

JSON.SET

JSON.GET

JSON.STRAPPEND

JSON.DEL

JSON.ARRAPPEND

JSON.ARRINSERT

JSON 全文检索

Friso 分词

比照 abase 数据库的分词（SCWS）

ES

总结

站内搜索