RedisJson-中文全文检索

RedisJson

  • 最近网上比拟火的RedisJson,置信大家都不生疏,还有一篇性能贴,说是RedisJson 横空出世,性能碾压ES和Mongo!,当然这些几百倍的晋升可能比拟主观,我比较关心的是RedisJson的json反对状况,全文检索性能,以及反对的中文分词

装置

1、官网有30天收费试用,内存有30M,创立一个实例即可,可用于测试

  • 可应用redis-cli进行连贯测试
[root@server bin]# ./redis-cli -h redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com -p 17137 -a 123456Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> 

2、能够本人装置reJson模块

下载门路:https://redis.com/redis-enter...

装置:https://oss.redis.com/redisjs...

[root@server bin]# ./redis-server --loadmodule /opt/thunisoft/redis/redisjson/rejson.so 82538:C 29 Dec 2021 18:41:09.585 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo82538:C 29 Dec 2021 18:41:09.585 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=82538, just started82538:C 29 Dec 2021 18:41:09.585 # Configuration loaded82538:M 29 Dec 2021 18:41:09.587 * monotonic clock: POSIX clock_gettime                _._                                                             _.-``__ ''-._                                                   _.-``    `.  `_.  ''-._           Redis 6.2.6 (00000000/0) 64 bit  .-`` .-```.  ```\/    _.,_ ''-._                                   (    '      ,       .-`  | `,    )     Running in standalone mode |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379 |    `-._   `._    /     _.-'    |     PID: 82538  `-._    `-._  `-./  _.-'    _.-'                                    |`-._`-._    `-.__.-'    _.-'_.-'|                                   |    `-._`-._        _.-'_.-'    |           https://redis.io         `-._    `-._`-.__.-'_.-'    _.-'                                    |`-._`-._    `-.__.-'    _.-'_.-'|                                   |    `-._`-._        _.-'_.-'    |                                    `-._    `-._`-.__.-'_.-'    _.-'                                         `-._    `-.__.-'    _.-'                                                 `-._        _.-'                                                         `-.__.-'                                               82538:M 29 Dec 2021 18:41:09.589 # Server initialized82538:M 29 Dec 2021 18:41:09.589 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> version: 20006 git sha: db3329c branch: HEAD82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> Exported RedisJSON_V1 API82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> Enabled diskless replication82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> Created new data type 'ReJSON-RL'82538:M 29 Dec 2021 18:41:09.591 * Module 'ReJSON' loaded from /opt/thunisoft/redis/redisjson/rejson.so82538:M 29 Dec 2021 18:41:09.602 * Loading RDB produced by version 6.2.682538:M 29 Dec 2021 18:41:09.602 * RDB age 98297 seconds82538:M 29 Dec 2021 18:41:09.603 * RDB memory usage when created 0.77 Mb82538:M 29 Dec 2021 18:41:09.603 # Done loading RDB, keys loaded: 2, keys expired: 0.82538:M 29 Dec 2021 18:41:09.603 * DB loaded from disk: 0.011 seconds82538:M 29 Dec 2021 18:41:09.603 * Ready to accept connections

批改redis.conf

/opt/thunisoft/redis/bin/redis.conf--增加loadmodule /opt/thunisoft/redis/redisjson/rejson.so

而后重启redis,JSON.SET曾经可用

[root@server bin]# sh start.sh [root@server bin]# ./redis-cli   -a 123456 Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.127.0.0.1:6379> JSON.SET jsonkey   .  '{"a":"b","c":["1","2","3"]}'OK127.0.0.1:6379> JSON.GET jsonkey"{\"a\":\"b\",\"c\":[\"1\",\"2\",\"3\"]}"127.0.0.1:6379> JSON.GET jsonkey .a"\"b\""

JSON应用

JSON.SET

127.0.0.1:6379>  JSON.SET doc . '{"a":2, "b": 3}'OK
  • SON.SET 是json设置命令
  • doc 是 key
  • . 是json文档的root,前面的一串是具体的 json 数据值
  • 如果应用的是RedisJson2.0+版本,能够将.替换为,JSON.SET doc $ '{"a":2, "b": 3}'

JSON.GET

  • JSON.GET获取json值
127.0.0.1:6379>  JSON.GET doc"{\"a\":2,\"b\":3}"127.0.0.1:6379>  JSON.GET doc a"2"
  • 嵌套构造,获取json值
127.0.0.1:6379> JSON.SET doc $ '{"a":2, "b": 3, "nested": {"a": 4, "b": null},"c":{"b":4}}'OK127.0.0.1:6379> JSON.GET doc b"3"-- $..b能够获取所有b的值127.0.0.1:6379> JSON.GET doc $..b"[3,null,4]"

JSON.STRAPPEND

  • JSON.STRAPPEND <key> [path] <json-string>
  • json-string 值附加 到字符串中 path path 如果未提供,则默认为 root。
127.0.0.1:6379> JSON.SET doc $ '{"a":"foo", "nested": {"a": "hello"}, "nested2": {"a": 31}}'OK127.0.0.1:6379> JSON.GET doc $"[{\"a\":\"foo\",\"nested\":{\"a\":\"hello\"},\"nested2\":{\"a\":31}}]"127.0.0.1:6379> 127.0.0.1:6379> JSON.STRAPPEND doc $..a '"baz"'1) (integer) 62) (integer) 83) (nil)127.0.0.1:6379> JSON.GET doc $"[{\"a\":\"foobaz\",\"nested\":{\"a\":\"hellobaz\"},\"nested2\":{\"a\":31}}]"

JSON.DEL

127.0.0.1:6379>  JSON.SET doc $ '{"a": 1, "nested": {"a": 2, "b": 3}}'OK127.0.0.1:6379> JSON.get doc"{\"a\":1,\"nested\":{\"a\":2,\"b\":3}}"127.0.0.1:6379> 127.0.0.1:6379> --删除127.0.0.1:6379> JSON.DEL doc $..a(integer) 2127.0.0.1:6379> 127.0.0.1:6379> JSON.get doc"{\"nested\":{\"b\":3}}"

JSON.ARRAPPEND

语法:JSON.ARRAPPEND <key> <path> <json> [json ...]

json 值附加 到数组中 path 的最初一个元素之后。

127.0.0.1:6379> JSON.SET doc $ '{"a":[1], "nested": {"a": [1,2]}, "nested2": {"a": 42}}'OK127.0.0.1:6379> JSON.ARRAPPEND doc $..a 3 41) (integer) 32) (integer) 43) (nil)127.0.0.1:6379> JSON.GET doc $"[{\"a\":[1,3,4],\"nested\":{\"a\":[1,2,3,4]},\"nested2\":{\"a\":42}}]"
  • json中嵌套数组,蕴含多条记录,相似于表

    127.0.0.1:6379> JSON.SET testarray .  '{"employees":[         {"name":"Alpha", "email":"alpha@gmail.com", "age":23},         {"name":"Beta", "email":"beta@gmail.com", "age":28},       {"name":"Gamma", "email":"gamma@gmail.com", "age":33},         {"name":"Theta", "email":"theta@gmail.com", "age":41}    ]}   'OK127.0.0.1:6379> 127.0.0.1:6379> 127.0.0.1:6379> 127.0.0.1:6379> JSON.get testarray"{\"employees\":[{\"name\":\"Alpha\",\"email\":\"alpha@gmail.com\",\"age\":23},{\"name\":\"Beta\",\"email\":\"beta@gmail.com\",\"age\":28},{\"name\":\"Gamma\",\"email\":\"gamma@gmail.com\",\"age\":33},{\"name\":\"Theta\",\"email\":\"theta@gmail.com\",\"age\":41}]}"

JSON.ARRINSERT

语法:JSON.ARRINSERT <key> <path> <index> <json> [json ...]

将值插入到数组中

127.0.0.1:6379> JSON.SET doc $ '{"a":[3], "nested": {"a": [3,4]}}'OK127.0.0.1:6379> JSON.ARRINSERT doc $..a 0 1 2 51) (integer) 42) (integer) 5127.0.0.1:6379> JSON.GET doc $"[{\"a\":[1,2,5,3],\"nested\":{\"a\":[1,2,5,3,4]}}]"

还有许多JSON操作,可参考:https://oss.redis.com/redisjs...

JSON全文检索

应用文档:https://developer.redis.com/h...

能够看到默认状况下,中文是不会进行分词,只是默认的依照逗号进行宰割。英文反对全文检索

查问材料得悉redisjson在创立索引的时候能够指定分词

FT.CREATE {index}    [ON {data_type}]       [PREFIX {count} {prefix} [{prefix} ...]       [FILTER {filter}]       [LANGUAGE {default_lang}]       [LANGUAGE_FIELD {lang_attribute}]       [SCORE {default_score}]       [SCORE_FIELD {score_attribute}]       [PAYLOAD_FIELD {payload_attribute}]    [MAXTEXTFIELDS] [TEMPORARY {seconds}] [NOOFFSETS] [NOHL] [NOFIELDS] [NOFREQS] [SKIPINITIALSCAN]    [STOPWORDS {num} {stopword} ...]    SCHEMA {identifier} [AS {attribute}]        [TEXT [NOSTEM] [WEIGHT {weight}] [PHONETIC {matcher}] | NUMERIC | GEO | TAG [SEPARATOR {sep}] [CASESENSITIVE] [SORTABLE [UNF]] [NOINDEX]] |        [VECTOR {algorithm} {count} [{attribute_name} {attribute_value} ...]] ...
  • json创立索引
  • ON JSON,如果是文本,则指定TEXT
--新建一个索引:i_index1redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   FT.CREATE i_index1 ON JSON LANGUAGE chinese SCHEMA $.title TEXT OK--插入数据redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   JSON.SET myDoc $ '{"title": "云南省昆明市盘龙区", "content": "bar1"}'OK--查问昆明市,能够查问出后果redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "昆明市" LANGUAGE chinese 1) (integer) 12) "myDoc"3) 1) "$"   2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"
  • 分词办法
  • 从上面的后果来看,查问的云南省昆明市盘龙区,均能够查问进去,然而查问昆明云南昆盘等就查问不进去。

    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "云南省" LANGUAGE chinese 1) (integer) 12) "myDoc"3) 1) "$" 2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "区" LANGUAGE chinese 1) (integer) 0redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "云南省" LANGUAGE chinese 1) (integer) 0redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "昆明市" LANGUAGE chinese 1) (integer) 12) "myDoc"3) 1) "$" 2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "昆明" LANGUAGE chinese 1) (integer) 0redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>       FT.SEARCH i_index1 "盘龙区" LANGUAGE chinese 1) (integer) 12) "myDoc"3) 1) "$" 2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   FT.SEARCH i_index1 "盘龙" LANGUAGE chinese 1) (integer) 0redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   FT.SEARCH i_index1 "区" LANGUAGE chinese 1) (integer) 0
  • 测试南京长江大桥
  • 能够看到将南京长江大桥,查问南京长江大桥没有后果,查问南京市,长江大桥有后果,擦测可能宰割成了南京市长江大桥

    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> JSON.SET myDoc $ '{"title": "南京市长江大桥", "content": "bar1"}'OKredis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "南京市" LANGUAGE chinese 1) (integer) 12) "myDoc"3) 1) "$"   2) "{\"title\":\"\xe5\x8d\x97\xe4\xba\xac\xe5\xb8\x82\xe9\x95\xbf\xe6\xb1\x9f\xe5\xa4\xa7\xe6\xa1\xa5\",\"content\":\"bar1\"}"redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "长江" LANGUAGE chinese1) (integer) 0redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "大桥" LANGUAGE chinese 1) (integer) 0redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "长江大桥" LANGUAGE chinese 1) (integer) 12) "myDoc"3) 1) "$"   2) "{\"title\":\"\xe5\x8d\x97\xe4\xba\xac\xe5\xb8\x82\xe9\x95\xbf\xe6\xb1\x9f\xe5\xa4\xa7\xe6\xa1\xa5\",\"content\":\"bar1\"}"redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "南京" LANGUAGE chinese 1) (integer) 0

创立索引时须要指定LANGUAGE chinese

redisjson:https://oss.redis.com/redisea...

  • 全文检索反对的语言:
arabicarmeniandanishdutchenglishfinnishfrenchgermanhungarianitaliannorwegianportugueseromanianrussianserbianspanishswedishtamilturkishyiddishchinese (see below)

RediSearch默认应用了Friso来进行中文分词

Friso:Friso 是应用 ANSI C 语言开发的一款开源中文分词器,应用风行的 mmseg
算法实现。齐全基于模块化设计和实现,能够很不便的植入其余程序中,例如:MySQL,PHP,源码无需批改就能在各种平台下编译应用,同时反对对 UTF-8/GBK 编码的切分。

Friso分词

  • 装置Friso分词,测试发现的确是这样
[root@server friso-1.6.1-release]# ./src/friso -init ./friso.ini Initialized in 0.340000secMode: Complex+-Version: 1.6.1 (UTF-8)+-----------------------------------------------------------+| friso - a chinese word segmentation writen by c.          || bug report email - chenxin619315@gmail.com.               || or: visit http://code.google.com/p/friso.                 ||     java edition for http://code.google.com/p/jcseg       || type 'quit' to exit the program.                          |+-----------------------------------------------------------+friso>> 南京市长江大桥分词后果:南京市 长江大桥 Done, cost < 0.000000secfriso>> 云南省昆明市盘龙区分词后果:云南省 昆明市 盘龙区 Done, cost < 0.000000secfriso>> 

Friso基于mmseg算法实现,以正向最大匹配为主,多种打消歧义的规定为辅

mmseg分词:http://technology.chtsai.org/...

每次从一个残缺的句子里,依照从左向右的程序,辨认出多种不同的3个词的组合;而后依据上面的4条消歧规定,确定最佳的备选词组合;抉择备选词组合中的第1个词,作为1次迭代的分词后果;残余的2个词持续进行下一轮的分词运算。采纳这种方法的益处是,为传统的前向最大匹配算法退出了上下文信息,解决了其每次选词只思考词自身,而漠视上下文相干词的问题。4条消歧规定包含,1)备选词组合的长度之和最大。2)备选词组合的均匀词长最大;3)备选词组合的词长变动最小;4)备选词组合中,单字词的呈现频率统计值最高。

比照abase数据库的分词(SCWS)

  • scws分词,会分的很细,根本涵盖所有词组的拆分
postgres=# select to_tsvector('testzhcfg','南京市长江大桥');                                      to_tsvector                                       ---------------------------------------------------------------------------------------- '南京':2 '南京市':1 '大':9 '大桥':6 '市':3 '桥':10 '江':8 '长':7 '长江':5 '长江大桥':4(1 row)postgres=#  select to_tsvector('testzhcfg','云南省昆明市盘龙区');                                             to_tsvector                                             ----------------------------------------------------------------------------------------------------- '区':12 '龙':11 '云南':2 '云南省':1 '市':7 '昆':6,10 '盘龙':9 '盘龙区':8 '昆明':5 '昆明市':4 '省':3(1 row)

ES

es有专门的分词引擎,反对多种分词器,常应用的IK分词

总结

1、RedisJson反对JSON全文检索,应用Friso分词,该分词分的不是特地细,会导致某些二元词组查问不到

2、比照JSON的操作性能比拟全面,RedisJson进去没多久,网上的利用场景比拟少

本文由博客一文多发平台 OpenWrite 公布!